2013-03-21 51 views
3

我有一個80,000行的數據庫,當我測試了一些FULLTEXT查詢時,我遇到了意想不到的結果。我從MYSQL刪除停用詞,並已設置的最小字長爲3MYSQL全文 - 意外的結果

當我做這個查詢:

SELECT `sentence`, MATCH (`sentence`) AGAINST ('CAN YOU FLY') AS `relevance` 
FROM `sentences` 
WHERE MATCH (`sentence`) AGAINST ('CAN YOU FLY') 
ORDER BY `relevance` DESC 

它給出了這樣的結果:

NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS | 10.623517036438 
I CAN FLY           | 7.61278629302979 
I CAN FLY :)          | 7.61278629302979 
CAN YOU FLY?          | 7.61278629302979 
THEY CAN FLY          | 7.61278629302979 
YOU AM NOT FLY         | 7.61278629302979 
CAN YOU FLY          | 7.61278629302979 
HAVE YOU EVER SWALLOWED A FLY?     | 7.52720737457275 
I JUST WANNA FLY         | 7.52720737457275 

爲什麼「沒有A沒有翅膀的飛行將被稱爲無翼「獲得了最高的相關性,它只包含其中一個單詞......另外,」CAN YOU FLY「如何出現在頂部,這完全匹配。

我想用大多數匹配的關鍵詞排序,然後按大多數順序排列,然後用最少的單詞排序。這將使邏輯結果:

CAN YOU FLY 
CAN YOU FLY? 
I CAN FLY 
THEY CAN FLY 
I CAN FLY :) 
YOU AM NOT FLY 
HAVE YOU EVER SWALLOWED A FLY? 
I JUST WANNA FLY 
NO A FLY WITHOUT WINGS WOULD BE CALLED A WINGLESS 

回答

1

用於計算的公式是提供MySQL Internals Manual

w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf) 

其中

dtf  is the number of times the term appears in the document 
sumdtf is the sum of (log(dtf)+1)'s for all terms in the same document 
U  is the number of Unique terms in the document 
N  is the total number of documents 
nf  is the number of documents that contain the term 

第一個文本顯然有更多的內容比其他人。該公式極大地依賴於U,這是文檔中唯一條款的數量。

通過您的意見,我會建議使用Boolean Fulltext Search

SELECT `sentence`, MATCH (`sentence`) AGAINST ('CAN YOU FLY' IN BOOLEAN MODE) AS `relevance` 
FROM `sentences` 
WHERE MATCH (`sentence`) AGAINST ('CAN YOU FLY' IN BOOLEAN MODE) 
ORDER BY `relevance` DESC 
+0

哇,他們認真地需要重新考慮他們的公式,如果連準確的短語在結果頂部... – Lenton 2013-03-21 23:33:06

+0

@ user1970772這是一個全文搜索,它不是爲3個單詞文檔而設計的。例如'FLY'出現在所有文件中,因此它不相關,它增加了'nf'的值。 – Tchoupi 2013-03-21 23:35:20

+0

您是否知道FULLTEXT的任何替代方案都能提供我想要的結果? – Lenton 2013-03-21 23:53:11