我有999個文件,我正在使用彈性搜索進行實驗。彈性搜索交叉字段,邊緣ngram分析器
中有我喜歡的類型映射場F4被分析,有以下設置分析儀:
"myNGramAnalyzer" => [
"type" => "custom",
"char_filter" => ["html_strip"],
"tokenizer" => "standard",
"filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
]
我的過濾器是如下:
"filter" => [
"ngram_filter" => [
"type" => "edgeNGram",
"min_gram" => "2",
"max_gram" => "20"
]
]
我有值字段F4爲「Proj1」,「Proj2」,「Proj3」......等等。
現在,當我嘗試使用「proj1」字符串的交叉字段進行搜索時,我期待將帶有「Proj1」的文檔返回到最高分的迴應的頂部。但事實並非如此。其餘所有數據在內容上幾乎相同。
另外我不明白爲什麼它匹配所有的999文件?
以下是我的搜索:
{
"index": "myindex",
"type": "mytype",
"body": {
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
}
我搜索的迴應是:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 999,
"max_score": 1,
"hits": [{
"_index": "myindex",
"_type": "mytype",
"_id": "42",
"_score": 1,
"_source": {
"f1": "396","f2": "125650","f3": "BH.1511AI.001",
"f4": "Proj42",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
}, {
"_index": "myindex",
"_type": "mytype",
"_id": "47",
"_score": 1,
"_source": {
"f1": "396","f2": "137946","f3": "BH.152096.001",
"f4": "Proj47",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
},
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
{
"_index": myindex,
"_type": "mytype",
"_id": "1",
"_score": 1,
"_source": {
"f1": "396","f2": "142095","f3": "BH.705215.001",
"f4": "Proj1",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
}]
}
}
任何東西,我做錯了或丟失? (道歉冗長的問題,但我想給所有可能的信息丟棄不必要的其他代碼)。
EDITED:
期限矢量響應
{
"_index": "myindex",
"_type": "mytype",
"_id": "10",
"_version": 1,
"found": true,
"took": 9,
"term_vectors": {
"f4": {
"field_statistics": {
"sum_doc_freq": 5886,
"doc_count": 999,
"sum_ttf": 5886
},
"terms": {
"pr": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"pro": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj1": {
"doc_freq": 111,
"ttf": 111,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj10": {
"doc_freq": 11,
"ttf": 11,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
}
}
}
}
}
EDITED 2名
映射爲字段F4
"f4" : {
"type" : "string",
"index_analyzer" : "myNGramAnalyzer",
"search_analyzer" : "standard"
}
我已更新爲使用第一andard分析儀的查詢時間,這已經改善了結果,但仍然不是我所期望的。
而不是999(所有文檔)現在它返回111個文檔,如「Proj1」,「Proj11」,「Proj111」......「Proj1」,「Proj181」.........等等。
仍然「Proj1」在結果之間而不在頂部。
你可以檢查文檔之一的術語向量:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html – alpert
@alpert更新了術語向量響應的問題 – Abubakkar
你能只需將** multi_match **搜索查詢的'type'從'cross_fields'更改爲'best_fields',然後再次檢查結果是否是所需結果。 –