2011-04-14 115 views
4

我目前使用Lucene作爲我們的全文搜索引擎。但是我們需要根據特定的字段對搜索結果進行排序。調整Lucene搜索結果得分按重量特定字段同名

例如,如果我們的索引中包含以下三個文檔,其中除了id字段之外的內容都完全相同。

val document01 = new Document() 
    val field0100 = new Field("id", "1", Field.Store.YES, Field.Index.ANALYZED) 
    val field0101 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED) 
    val field0102 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED) 
    document01.add(field0100) 
    document01.add(field0101) 
    document01.add(field0102) 

    val document02 = new Document() 
    val field0200 = new Field("id", "2", Field.Store.YES, Field.Index.ANALYZED) 
    val field0201 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED) 
    val field0202 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED) 
    document02.add(field0200) 
    document02.add(field0201) 
    document02.add(field0202) 

    val document03 = new Document() 
    val field0300 = new Field("id", "3", Field.Store.YES, Field.Index.ANALYZED) 
    val field0301 = new Field("contents", "This is a test: Linux", Field.Store.YES, Field.Index.ANALYZED) 
    val field0302 = new Field("contents", "This is a test: Windows", Field.Store.YES, Field.Index.ANALYZED) 
    document03.add(field0300) 
    document03.add(field0301) 
    document03.add(field0302) 

現在,當我使用IndexSearcher的搜索Linux,我得到了以下結果:

Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 

當我搜索Windows,我得到相同的排序相同的結果。

Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 

問題是可以在構建索引時權重特定的字段嗎?例如,如果匹配搜索時,我想讓field0201得分較高。

換句話說,當我搜索Linux,我想得到的結果按以下順序:

Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 

當我搜索Windows,它仍然是原來的排序,如下所示:

Document<stored,indexed,tokenized<id:1> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:2> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 
Document<stored,indexed,tokenized<id:3> stored,indexed,tokenized<contents:This is a test: Linux> stored,indexed,tokenized<contents:This is a test: Windows>> 

我試着用field0201.setBoost(),但它會改變搜索排序的結果既當我搜索LinuxWindows

+0

它看起來像文件都包含除ID以外的相同數據。爲什麼你會期望得分不同? – huynhjl 2011-04-14 02:38:11

+0

@huynhjil因爲內容來自不同的來源。如果與搜索字詞匹配,我希望來自特定來源的字段得分較高。換句話說,它應該與使用(得分lucene計算,場源)對進行排序。 – 2011-04-14 02:44:22

+0

您是否可以使用傳遞給TopFieldCollector的Sort實例進行排序? ......或者你是否明確地想要用你的領域的分數來做到這一點(只有在內容不一致的情況下才會有效)? – csupnig 2011-04-14 07:03:43

回答

4

我認爲應該有可能的,如果你把你的數據不同的來源不同的名稱字段。您可以在索引時設置一個提升,但如果您使用相同的名稱,我認爲提升將適用於具有相同名稱的所有字段 - 基於setBoost javadoc。所以,如果你這樣做,而不是:

val field0201 = new Field("content-high", "This is a test: Linux", ...) 
field0201.setBoost(1.5f) 
val field0202 = new Field("content-low", "This is a test: Windows", ...) 

然後用content-high:Linux content-low:Linux查詢(使用布爾查詢有兩個條款均設置爲長期Linux的),然後含量高升壓應加大如果匹配在該字段中,則文檔得分。使用explain看看是否有效。

+0

謝謝,這似乎工作得很好! – 2011-04-15 02:05:25