查找兩個文檔之間的相似度

是否有一個內置算法來查找lucene中兩個文檔之間的相似度？當我通過默認的相似性類時，它比較查詢和文檔後給出得分作爲結果。查找兩個文檔之間的相似度

我已經索引了我的文檔a，使用了雪球分析器，下一步就是找到兩個文檔之間的相似度。

有人可以提出解決方案嗎？

2012-01-13 CTsiddharth

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene – Mikos 2012-02-16 21:07:04

似乎沒有內置算法。我相信有三種方法可以解決這個問題：

a）在其中一個文檔上運行MoreLike查詢。迭代結果，檢查文檔ID並獲得分數。也許不是很漂亮，你可能需要爲你想要返回的文件返回很多文件。 b）餘弦相似度：Mikos在他的評論中提供的答案解釋瞭如何計算兩個文件的餘弦相似度。

c）計算你自己的Lucene相似度分數。 Lucene得分給Cosine相似度增加了一些因素（http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html）。

您可以使用

DefaultSimilarity ds = new DefaultSimilarity(); 
SimScorer scorer = ds.simScorer(stats , arc); 
scorer.score(otherDocId, freq);

您可以通過

AtomicReaderContext arc = IndexReader.leaves().get(0); 
SimWeight stats = ds.computeWeight(1, collectionStats, termStats); 
stats.normalize(1, 1);

得到例如參數，其中，反過來，你可以使用你的第一個兩個文件的TermVector獲得長期統計數據，以及您的IndexReader用於收集統計信息。要獲得freq參數，使用

DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, null, field, term);

，通過文檔迭代，直到你找到你的第一個文檔的DOC的ID，並做

freq = docsEnum.freq();

請注意，你需要調用「scorer.score」對於你的第一個文檔中的每個術語（或每個術語你想考慮），並總結結果。

最後，用「queryNorm」和「座標」參數相乘，就可以使用

//sumWeights was computed while iterating over the first termvector 
//in the main loop by summing up "stats.getValueForNormalization();" 
float queryNorm = ds.queryNorm(sumWeights); 
//thisTV and otherTV are termvectors for the two documents. 
//overlap can be easily calculated 
float coord = ds.coord(overlap, (int) Math.min(thisTV.size(), otherTV.size())); 
return coord * queryNorm * score;

因此，這是一個應該工作的方式。它並不優雅，並且由於獲得期限頻率的困難（對每個術語迭代DocsEnum），它也不是很有效。我仍然希望這可以幫助某人:)

來源

2015-01-22 01:39:04

查找兩個文檔之間的相似度

回答

相關問題