如何在PostgreSQL的複雜嵌套JSONB上實現全文搜索

我在一個jsonb列中存儲了非常複雜的JSONB。如何在PostgreSQL的複雜嵌套JSONB上實現全文搜索

DB表如下所示：

CREATE TABLE sites (
    id text NOT NULL, 
    doc jsonb, 
    PRIMARY KEY (id) 
)

的數據，我們在doc列存儲是一個複雜的嵌套JSONB數據：

{ 
     "_id": "123", 
     "type": "Site", 
     "identification": "Custom ID", 
     "title": "SITE 1", 
     "address": "UK, London, Mr Tom's street, 2", 
     "buildings": [ 
      { 
       "uuid": "12312", 
       "identification": "Custom ID", 
       "name": "BUILDING 1", 
       "deposits": [ 
        { 
         "uuid": "12312", 
         "identification": "Custom ID",    
         "audits": [ 
          { 
          "uuid": "12312",   
           "sample_id": "SAMPLE ID"     
          } 
         ] 
        } 
       ] 
      } 
     ] 
    }

所以我JSONB的結構是這樣的：

SITE 
    -> ARRAY OF BUILDINGS 
    -> ARRAY OF DEPOSITS 
     -> ARRAY OF AUDITS

我們需要實現全文搜索me值在每種類型的條目中：

SITE (identification, title, address) 
BUILDING (identification, name) 
DEPOSIT (identification) 
AUDIT (sample_id)

SQL查詢應該只在這些字段值中運行全文搜索。

我猜需要使用GIN索引和tsvector之類的東西，但是沒有足夠的Postgresql背景。

所以，我的問題是有可能索引，然後查詢這樣嵌套的JSONB結構？

來源

2017-08-14 rusllonrails

第一槍會使存儲「非規範化」：爲了簡潔起見，犧牲一些存儲空間。在單獨的字段中提取所需的數據：網站，建築物，存款，審計，包含所需字段的純字符串聯合，即'building.identification ||';'|| building.title ||';'|| building.address'等（這可以使用postgres的函數作爲默認值完成，或者如果您的數據被修改，則可以使用基於觸發器的函數）。然後在這些字段上創建GIN索引 - >然後在這些字段上構建相應的全文查詢 –

謝謝@IlyaDyoshin。我喜歡你的想法 - 會試着去嘗試它。 – rusllonrails

或者你可以等到10.0發佈 - 其中json/jsonb FTS將成爲頭等公民https://www.postgresql.org/docs/10/static/release-10.html –

讓我們增加tsvector類型的新列：

alter table sites add column tsvector tsvector;

現在讓我們創建一個觸發器，它會收集lexems，組織他們，並把我們的tsvector。我們將使用4個組（A，B，C，D） - 這是一個特殊的tsvector功能，可以讓您在搜索時稍後區分詞位（請參見手冊https://www.postgresql.org/docs/current/static/textsearch-controls.html中的示例;不幸的是，此功能最多隻支持4組開發商只保留2爲位，但我們很幸運在這裏，我們只需要4組）：

create or replace function t_sites_tsvector() returns trigger as $$ 
declare 
    dic regconfig; 
    part_a text; 
    part_b text; 
    part_c text; 
    part_d text; 
begin 
    dic := 'simple'; -- change if you need more advanced word processing (stemming, etc) 

    part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', ''); 

    select into part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ') 
    from (
    select 
     jsonb_array_elements((new.doc->'buildings'))->>'identification', 
     jsonb_array_elements((new.doc->'buildings'))->>'name' 
) _(a, b); 

    select into part_c string_agg(coalesce(c, ''), ' ') 
    from (
    select jsonb_array_elements(b)->>'identification' from (
     select jsonb_array_elements((new.doc->'buildings'))->'deposits' 
    ) _(b) 
) __(c); 

    select into part_d string_agg(coalesce(d, ''), ' ') 
    from (
    select jsonb_array_elements(c)->>'sample_id' 
    from (
     select jsonb_array_elements(b)->'audits' from (
     select jsonb_array_elements((new.doc->'buildings'))->'deposits' 
    ) _(b) 
    ) __(c) 
) ___(d); 

    new.tsvector := setweight(to_tsvector(dic, part_a), 'A') 
    || setweight(to_tsvector(dic, part_b), 'B') 
    || setweight(to_tsvector(dic, part_c), 'C') 
    || setweight(to_tsvector(dic, part_d), 'D') 
    ; 
    return new; 
end; 
$$ language plpgsql immutable; 

create trigger t_sites_tsvector 
    before insert or update on sites for each row execute procedure t_sites_tsvector();

^^ - 滾動它，這個片段是大於它看起來（尤其是你有MacOS的W/Ø滾動條...）

現在讓我們創建GIN索引用來加快搜索查詢（有道理的，如果你有很多行 - 比如，超過幾百或幾千個）：

create index i_sites_fulltext on sites using gin(tsvector);

現在我們插入一些檢查：

insert into sites select 1, '{ 
     "_id": "123", 
     "type": "Site", 
     "identification": "Custom ID", 
     "title": "SITE 1", 
     "address": "UK, London, Mr Tom''s street, 2", 
     "buildings": [ 
      { 
       "uuid": "12312", 
       "identification": "Custom ID", 
       "name": "BUILDING 1", 
       "deposits": [ 
        { 
         "uuid": "12312", 
         "identification": "Custom ID", 
         "audits": [ 
          { 
          "uuid": "12312", 
           "sample_id": "SAMPLE ID" 
          } 
         ] 
        } 
       ] 
      } 
     ] 
    }'::jsonb;

請與select * from sites; - 你必須看到，tsvector中填入一些數據。

現在讓我們來查詢它：

select * from sites where tsvector @@ to_tsquery('simple', 'sample');

- 它必須返回我們的記錄。在這種情況下，我們搜索'sample'單詞，我們不在乎它將在哪個組中找到。

讓我們改變它，並嘗試搜索只在A組（「網站（標識，名稱，地址）」像你描述它）：

select * from sites where tsvector @@ to_tsquery('simple', 'sample:A');

- 因爲字'sample'坐鎮僅此不能返回任何結果在組D（「AUDIT（sample_id）」）中。確實：

- 將再次返回我們的記錄。

請注意，您需要使用to_tsquery(..)而不是plainto_tsquery(..)才能夠解決4組問題。因此，您需要自己清理輸入（避免使用或刪除特殊字符，如&和|，因爲它們在tsquery值中有特殊含義）。

而好消息是，你可以在一個查詢中組合不同的羣體，這樣的：

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D & london:A');

的另一種方式去（例如，如果你有超過4組工作）是有多個tsvector，每個坐在一個單獨的列中，使用單個查詢構建它們，創建索引（您可以在多個tsvector列上創建單個索引），並查詢尋址單獨的列。這與我上面解釋的類似，但可能效率較低。

希望這會有所幫助。

來源

2017-08-22 03:09:44 Nick

非常感謝你@尼克。很快就會看看你的建議。 – rusllonrails

當然。讓我知道，如果smth不清楚。 – Nick

嘿@尼克我忘了說你大感謝）我測試了你的方法，它的工作輝煌！非常感謝你的朋友 – rusllonrails

如何在PostgreSQL的複雜嵌套JSONB上實現全文搜索

回答

相關問題