0

我有999個文件,我正在使用彈性搜索進行實驗。彈性搜索交叉字段,邊緣ngram分析器

中有我喜歡的類型映射場F4被分析,有以下設置分析儀:

"myNGramAnalyzer" => [ 
     "type" => "custom", 
     "char_filter" => ["html_strip"], 
     "tokenizer" => "standard", 
     "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"] 
    ] 

我的過濾器是如下:

"filter" => [ 
     "ngram_filter" => [ 
      "type" => "edgeNGram", 
      "min_gram" => "2", 
      "max_gram" => "20" 
     ] 
    ] 

我有值字段F4爲「Proj1」,「Proj2」,「Proj3」......等等。

現在,當我嘗試使用「proj1」字符串的交叉字段進行搜索時,我期待將帶有「Proj1」的文檔返回到最高分的迴應的頂部。但事實並非如此。其餘所有數據在內容上幾乎相同。

另外我不明白爲什麼它匹配所有的999文件?

以下是我的搜索:

{ 
    "index": "myindex", 
    "type": "mytype", 
    "body": { 
     "query": { 
      "multi_match": { 
       "query": "proj1", 
       "type": "cross_fields", 
       "operator": "and", 
       "fields": "f*" 
      } 
     }, 
     "filter": { 
      "term": { 
       "deleted": "0" 
      } 
     } 
    } 
} 

我搜索的迴應是:

{ 
    "took": 12, 
    "timed_out": false, 
    "_shards": { 
     "total": 5, 
     "successful": 5, 
     "failed": 0 
    }, 
    "hits": { 
     "total": 999, 
     "max_score": 1, 
     "hits": [{ 
      "_index": "myindex", 
      "_type": "mytype", 
      "_id": "42", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "125650","f3": "BH.1511AI.001", 
       "f4": "Proj42", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     }, { 
      "_index": "myindex", 
      "_type": "mytype", 
      "_id": "47", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "137946","f3": "BH.152096.001", 
       "f4": "Proj47", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     }, 
     //....... 
     //....... 
     //MANY RECORDS IN BETWEEN HERE 
     //....... 
     //....... 
     { 
      "_index": myindex, 
      "_type": "mytype", 
      "_id": "1", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "142095","f3": "BH.705215.001", 
       "f4": "Proj1", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     //....... 
     //....... 
     //MANY RECORDS IN BETWEEN HERE 
     //....... 
     //....... 
     }] 
    } 
} 

任何東西,我做錯了或丟失? (道歉冗長的問題,但我想給所有可能的信息丟棄不必要的其他代碼)。

EDITED:

期限矢量響應

{ 
    "_index": "myindex", 
    "_type": "mytype", 
    "_id": "10", 
    "_version": 1, 
    "found": true, 
    "took": 9, 
    "term_vectors": { 
     "f4": { 
      "field_statistics": { 
       "sum_doc_freq": 5886, 
       "doc_count": 999, 
       "sum_ttf": 5886 
      }, 
      "terms": { 
       "pr": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "pro": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj1": { 
        "doc_freq": 111, 
        "ttf": 111, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj10": { 
        "doc_freq": 11, 
        "ttf": 11, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       } 
      } 
     } 
    } 
} 

EDITED 2名

映射爲字段F4

"f4" : { 
    "type" : "string", 
    "index_analyzer" : "myNGramAnalyzer", 
    "search_analyzer" : "standard" 
} 

我已更新爲使用第一andard分析儀的查詢時間,這已經改善了結果,但仍然不是我所期望的。

而不是999(所有文檔)現在它返回111個文檔,如「Proj1」,「Proj11」,「Proj111」......「Proj1」,「Proj181」.........等等。

仍然「Proj1」在結果之間而不在頂部。

+0

你可以檢查文檔之一的術語向量:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html – alpert

+0

@alpert更新了術語向量響應的問題 – Abubakkar

+0

你能只需將** multi_match **搜索查詢的'type'從'cross_fields'更改爲'best_fields',然後再次檢查結果是否是所需結果。 –

回答

0

後的支出小時的時間來找到解決這個問題,我終於做到了工作。

所以我保持一切與我的問題中提到的一樣,使用n gram analzyer,同時索引數據。我唯一需要改變的是,在我的搜索查詢中使用all字段作爲我現有的multi-match查詢的布爾查詢。

現在我的搜索文本結果Proj1將返回我結果的順序,如Proj1Proj121Proj11

雖然這不返回的確切順序一樣Proj1Proj11Proj121等,但它仍然非常類似我想要的結果。

1

沒有index_analyzer(至少不是從Elasticsearch版本1.7)。對於mapping parameters,您可以使用analyzersearch_analyzer。 請嘗試以下步驟以使其正常工作。

與分析儀設置創建myindex:

PUT /myindex 
{ 
    "settings": { 
    "analysis": { 
     "filter": { 
      "ngram_filter": { 
       "type": "edge_ngram", 
       "min_gram": 2, 
       "max_gram": 20 
      } 
     }, 
     "analyzer": { 
      "myNGramAnalyzer": { 
       "type": "custom", 
       "tokenizer": "standard", 
       "char_filter": "html_strip", 
       "filter": [ 
        "lowercase", 
        "standard", 
        "asciifolding", 
        "stop", 
        "snowball", 
        "ngram_filter" 
       ] 
      } 
     } 
     } 
    } 
} 

添加映射到MYTYPE(使它總之我只是映射相關領域):

PUT /myindex/_mapping/mytype 
{ 
    "properties": { 
     "f1": { 
     "type": "string" 
     }, 
     "f4": { 
     "type": "string", 
     "analyzer": "myNGramAnalyzer", 
     "search_analyzer": "standard" 
     }, 
     "deleted": { 
     "type": "string" 
     } 
    } 
} 

指數的一些數據:

PUT myindex/mytype/1 
{ 
    "f1":"396", 
    "f4":"Proj12" , 
    "deleted": "0" 
} 

PUT myindex/mytype/2 
{ 
    "f1":"42", 
    "f4":"Proj22" , 
    "deleted": "1" 
} 

現在試試你的查詢:

GET myindex/mytype/_search 
{ 
    "query": { 
     "multi_match": { 
     "query": "proj1", 
     "type": "cross_fields", 
     "operator": "and", 
     "fields": "f*" 
     } 
    }, 
    "filter": { 
     "term": { 
     "deleted": "0" 
     } 
    } 
} 

它應該返回文檔#1。它爲我工作Sense。我正在使用Elasticsearch 2.X版本。

希望我已成功地幫助:)

+0

你是否試過這樣做,通過添加帶有字段f4的文件作爲Proj1,Proj11,Proj12,Proj13,Proj121,Proj111,因爲我的東西不工作爲了這。它已經在爲您在示例中使用的文檔工作了。 – Abubakkar

+0

另外,我知道'index_analyzer',我使用支持它的舊版本。 – Abubakkar

+0

當我索引: 'PUT myindex/mytype/_bulk {「index」:{「_id」:「1」}} {「f1」:「396」,「f4」:「Proj1」,「deleted」 :「0」} {「index」:{「_id」:「2」}} {「f1」:「396」,「f4」:「Proj11」,「deleted」:「0」} { index「:{」_id「:」3「}} {」f1「:」396「,」f4「:」Proj13「,」deleted「:」1「} {」index「:{」_id「 「4」}} {「f1」:「396」,「f4」:「Proj121」,「刪除」:「1」} {「index」:{「_id」:「5」}} { f1「:」396「,」f4「:」Proj111「,」刪除「:」1「} 我得到的文件是:'#1'和'#2'不是你想要的嗎? –