我對ES很新,我一直在研究ES中的評分,試圖提高搜索結果的質量。我遇到了這樣一種情況,其中queryNorm
函數在整個分片中非常不同(5倍大)。對於查詢中的術語,我可以看到對idf
的依賴關係,這在整個分片中可能不同。然而,就我而言,我有一個搜索詞+跨越分片的idf度量彼此接近(絕對不足以導致X 5倍的差異)。我將簡要描述我的設置,包括我的查詢和解釋端點的結果。彈性搜索跨分片的不同查詢規範
設置 我有一個索引,約6500個文檔分佈在5個分片上。我提到下面的查詢中出現的字段沒有索引時間提升。我提到我的設置使用ES 2.4與「query_then_fetch」。我查詢:
{
"query" : {
"bool" : {
"must" : [ {
"bool" : {
"must" : [ ],
"must_not" : [ ],
"should" : [ {
"multi_match" : {
"query" : "pds",
"fields" : [ "field1" ],
"lenient" : true,
"fuzziness" : "0"
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field2" ],
"lenient" : true,
"fuzziness" : "0",
"boost" : 1000.0
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field3" ],
"lenient" : true,
"fuzziness" : "0",
"boost" : 500.0
}
}, {
"multi_match" : {
"query" : "pds",
"fields" : [ "field4" ],
"lenient" : true,
"fuzziness" : "0",
"boost": 100.0
}
} ],
"must_not" : [ ],
"should" : [ ],
"filter" : [ ]
}
},
"size" : 1000,
"min_score" : 0.0
}
(有查詢規範5X倍大的另一個之一)解釋輸出的文件2:
{
"_shard" : 4,
"_explanation" : {
"value" : 2.046937,
"description" : "product of:",
"details" : [ {
"value" : 4.093874,
"description" : "sum of:",
"details" : [ {
"value" : 0.112607226,
"description" : "weight(field1:pds in 93) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.112607226,
"description" : "score(doc=93,freq=1.0), product of:",
"details" : [ {
"value" : 0.019996,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 2.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.0017753748,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "fieldWeight in 93, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=93)",
"details" : [ ]
} ]
} ]
} ]
}, {
"value" : 3.9812667,
"description" : "weight(field4:pds in 93) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 3.9812667,
"description" : "score(doc=93,freq=2.0), product of:",
"details" : [ {
"value" : 0.9998001,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 100.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.0017753748,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 3.9820628,
"description" : "fieldWeight in 93, product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(freq=2.0), with freq of:",
"details" : [ {
"value" : 2.0,
"description" : "termFreq=2.0",
"details" : [ ]
} ]
}, {
"value" : 5.6314874,
"description" : "idf(docFreq=11, maxDocs=1232)",
"details" : [ ]
}, {
"value" : 0.5,
"description" : "fieldNorm(doc=93)",
"details" : [ ]
} ]
} ]
} ]
} ]
}, {
"value" : 0.5,
"description" : "coord(2/4)",
"details" : [ ]
} ]
}
},
{
"_shard" : 2,
"_explanation" : {
"value" : 0.4143453,
"description" : "product of:",
"details" : [ {
"value" : 0.8286906,
"description" : "sum of:",
"details" : [ {
"value" : 0.018336227,
"description" : "weight(field1:pds in 58) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.018336227,
"description" : "score(doc=58,freq=1.0), product of:",
"details" : [ {
"value" : 0.0030464241,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 2.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 2.5307006E-4,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "fieldWeight in 58, product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(freq=1.0), with freq of:",
"details" : [ {
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 1.0,
"description" : "fieldNorm(doc=58)",
"details" : [ ]
} ]
} ]
} ]
}, {
"value" : 0.81035435,
"description" : "weight(field4:pds in 58) [PerFieldSimilarity], result of:",
"details" : [ {
"value" : 0.81035435,
"description" : "score(doc=58,freq=2.0), product of:",
"details" : [ {
"value" : 0.1523212,
"description" : "queryWeight, product of:",
"details" : [ {
"value" : 100.0,
"description" : "boost",
"details" : [ ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 2.5307006E-4,
"description" : "queryNorm",
"details" : [ ]
} ]
}, {
"value" : 5.3200364,
"description" : "fieldWeight in 58, product of:",
"details" : [ {
"value" : 1.4142135,
"description" : "tf(freq=2.0), with freq of:",
"details" : [ {
"value" : 2.0,
"description" : "termFreq=2.0",
"details" : [ ]
} ]
}, {
"value" : 6.0189342,
"description" : "idf(docFreq=11, maxDocs=1815)",
"details" : [ ]
}, {
"value" : 0.625,
"description" : "fieldNorm(doc=58)",
"details" : [ ]
} ]
} ]
} ]
} ]
}, {
"value" : 0.5,
"description" : "coord(2/4)",
"details" : [ ]
} ]
}
}
注意如何queryNorm
上field1
從碎片文件4爲「0.0017753748」(idf爲5.6314874),而對於分片2中doc相同字段的queryNorm
爲「0.0002.5307006」(idf爲6.0189342)。我嘗試使用http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html上的公式,手動計算queryNorm
的計算結果,但未能獲得相同的答案。
我還沒有看到太多關於計算queryNorm
的帖子/帖子;其中一個我發現有用的是http://www.openjems.com/tag/querynorm/(這實際上是Solr,但是因爲查詢是「query_then_fetch」; Lucene計算應該是唯一重要的事情,所以我期望它們應該有相似的表現)。然而,我不能使用相同的方法得出正確的queryNorm
值(盡我所知,t.getBoost()應該爲1,因爲在上面的查詢中沒有索引時間字段提升+沒有特殊字段提升)。
有沒有人有什麼建議可能會發生在這裏?
我試過了「dfs_query_then_fetch」選項,最後的分數沒有太大變化。不幸的是,由於解釋端點https://github.com/elastic/elasticsearch/issues/15369中存在一個錯誤(我在2016年8月修復了該錯誤,並且我的版本早於該版本),所以我似乎無法看到更新後的解釋。 )。我的直覺是,別的東西也影響得分。 –
您可以使用'dfs_query_then_fetch'選項提供您的請求和響應嗎?你有什麼ES版本? – Random