0
我有一個包含一些短語的文件。通過lucene使用jarowinkler,它應該從我的輸入中得到最類似的短語。lucene中的JarowinklerDistance返回奇怪的結果
這是我的問題的一個例子。
我們有一個包含文件:
//phrases.txt
this is goodd
this is good
this is god
如果我輸入的是這是一個好,它應該是讓我「這是件好事」從文件中第一次,因爲這裏的相似性得分是最大(1)。但由於某種原因,它返回:「這很好」和「這只是上帝」!
這裏是我的代碼:
try {
SpellChecker spellChecker = new SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
Dictionary dictionary = new PlainTextDictionary(new File("src/main/resources/words.txt").toPath());
IndexWriterConfig iwc=new IndexWriterConfig(new ShingleAnalyzerWrapper());
spellChecker.indexDictionary(dictionary,iwc,false);
String wordForSuggestions = "this is good";
int suggestionsNumber = 5;
String[] suggestions = spellChecker.suggestSimilar(wordForSuggestions, suggestionsNumber,0.8f);
if (suggestions!=null && suggestions.length>0) {
for (String word : suggestions) {
System.out.println("Did you mean:" + word);
}
}
else {
System.out.println("No suggestions found for word:"+wordForSuggestions);
}
} catch (IOException e) {
e.printStackTrace();
}