Solr自定義過濾器TokenStream問題

對於索引和查詢，我需要執行下面列出的某些轉換。所以我寫了一個自定義過濾器。我如何執行令牌的連接並將其傳遞給NGramFilterFactory過濾器。請告訴我代碼中需要改進的地方。Solr自定義過濾器TokenStream問題

這是Schema.xml文件的配置：

<tokenizer class="solr.WhitespaceTokenizerFactory"/> 

    <filter class="solr.LowerCaseFilterFactory" /> 
    <filter class="solr.TrimFilterFactory" /> 
    <filter class="solr.TrimFilterFactory" pattern="([^a-z])" replacement="" replace="all" /> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="custom_stop_words.txt"/> 
    <filter class="intuit.ripple.solr.ConcatFilterFactory" /> 
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="3" />

這裏採用的情況下，我試圖解決的一個例子：

1. Input value: "foo Bar Baz qux" 
2. WhitespaceTokenizerFactory: "foo", "Bar", "Baz", "qux" 
3. LowerCaseFilterFactory: "foo", "bar", "baz", "qux" 
4. TrimFilterFactory, TrimFilterFactory and StopFilterFactory have nothing to do in this case. 
5. ConcatFilterFactory: "foobarbazqux". It should concatenate the tokens. 
6. NGramFilterFactory: This will generate the token.

這裏是ConcatFilter的incrementToken()方法：

@Override 
public boolean incrementToken() throws IOException { 

    StringBuilder builder = new StringBuilder(); 

    while (input.incrementToken()) { 
     int len = charTermAtt.length(); 
     char buffer[] = charTermAtt.buffer(); 
     builder.append(buffer, 0, len); 
     System.out.println("Tokens: " + new String(buffer, 0, len)); 
     clearAttributes(); 
     charTermAtt.setEmpty(); 
    } 
    System.out.println("Concat tokens: " + builder.toString()); 

    charTermAtt.copyBuffer(builder.toString().toCharArray(), 0, builder.length()); 
    charTermAtt.setLength(builder.length()); 
    posIncAtt.setPositionIncrement(1); 
    setOffsetAttr.setOffset(0, builder.length()); 

    input.end(); 
    input.close(); 
    return false; 
}

這裏我使用while循環來獲取所有的令牌並加入把它們放在一起。有沒有辦法一次獲取所有的令牌沒有循環？

來源

2014-12-05 YoungHobbit

可能重複（http://stackoverflow.com/questions/27560110/solr-custom-filter-for-cancatnating-tokens） – YoungHobbit 2015-08-15 12:31:50

我想你想做些別的事情比你實現：d

你incrementToken方法做的只是通過整個輸入迭代（從該的StopFilter輸出的情況下）。在每次調用增量令牌時，您只需從輸入中獲取單個（或更多）（如果需要）令牌，並生成單個令牌即可輸出。

所以我想你不想在這裏使用「while」循環，而且在每次交互中調用「clearAttributes（）」。

我也想你們的輸出中是這樣的：

Tokens: foo 
Tokens: bar 
Concat tokens: foo bar

但實際上從兩個記號「富」與「酒吧」你製作單個令牌「富巴」我的猜測是不是你的意圖。請描述你的ConcatFilterFactory應該做什麼。目前它僅將多個令牌合併爲單個令牌。

您有一個關於TokenFilter的討論示例：http://search-lucene.com/m/ukJmjphJte/tokenfilter&subj=custom+TokenFilter。您可以使用此搜索框來查找有關Solr/Lucene相關信息的更多信息：http://search-lucene.com/

來源

2014-12-08 09:55:33

不，我想連接來自StopFilter的令牌。例如來自StopFilter的Tokens是「foo」和「bar」，那麼我想將它們連接爲單個標記「foobar」並將其傳遞給「NGramFilterFactory」類。 – YoungHobbit 2014-12-08 10:32:42

所以，請在第一次通過時返回'true'，在incrementToken方法的第二次通話中返回'false' – 2014-12-08 18:13:57

但我如何在第一次調用中檢索所有令牌（例如foo和bar）並將它們連接起來。我對TokenStream和Filters非常瞭解。你能否爲我提供一些代碼（僞）代碼。 – YoungHobbit 2014-12-09 04:07:42

我認爲無法一次獲取所有的標記，您需要循環，就像在代碼中一樣。但是你可以使用不同的方法。而不是使用solr.WhitespaceTokenizerFactory使用solr.KeywordTokenizerFactory。 KeywordTokenizerFactory只在令牌流中放入一個令牌，它是確切的輸入值。然後在您的ConcatFilter中，您只需從令牌流中獲取第一個也是唯一的令牌，並將其中的所有空白替換爲空字符串。在這種情況下，您需要在NGramFilter之後輸入StopFilter。使用您的例子中，你將有：

1. Input value: "foo Bar Baz qux" 
2. KeywordTokenizerFactory: "foo Bar Baz qux" 
3. LowerCaseFilterFactory: "foo bar baz qux" 
4. ConcatFilterFactory: "foobarbazqux". 
5. NGramFilterFactory: This will generate the token. 
6. StopFilterFactory cuts all unwanted tokens.

[爲cancatnating令牌Solr的自定義過濾器]的

來源

2016-05-05 11:03:12

Solr自定義過濾器TokenStream問題

回答

相關問題