Lucene 6.1自定義Tokenizer和分析器

我正在尋求一些有關Lucene 6.1 API的幫助。Lucene 6.1自定義Tokenizer和分析器

我試圖擴展Lucene的Tokenizer和Analyzer，但我不明白所有的指南。在所有教程中，用戶的Tokenizer將覆蓋增量。在構造函數中，它們有Reader類，在用戶的Analyzer類中它們覆蓋了createComponents方法。但在Lucene中它只有1個字符串參數，所以如何將Reader添加到我的Analyzer？

我的代碼：

public class ChemTokenizer extends Tokenizer{ 
    protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class); 
    protected String stringToTokenize; 
    protected int position = 0; 
    protected List<int[]> chemicals = new ArrayList<>(); 

    @Override 
    public boolean incrementToken() throws IOException { 
     // Clear anything that is already saved in this.charTermAttribute 
     this.charTermAttribute.setEmpty(); 

     // Get the position of the next symbol 
     int nextIndex = -1; 
     Pattern p = Pattern.compile("[^A-zА-я]"); 
     Matcher m = p.matcher(stringToTokenize.substring(position)); 
     nextIndex = m.start(); 
     // Did we lose chemicals? 
     for (int[] pair: chemicals) { 
      if (pair[0] < nextIndex && pair[1] > nextIndex) { 
       //We are in the chemical name 
       if (position == pair[0]) { 
        nextIndex = pair[1]; 
       } 
       else { 
        nextIndex = pair[0]; 
       } 
      } 
     } 
     // Next separator was found 
     if (nextIndex != -1) { 
      String nextToken = stringToTokenize.substring(position, nextIndex); 
      charTermAttribute.append(nextToken); 
      position = nextIndex + 1; 
      return true; 
     } 
     // Last part of text 
     else if (position < stringToTokenize.length()) { 
      String nextToken = stringToTokenize.substring(position); 
      charTermAttribute.append(nextToken); 
      position = stringToTokenize.length(); 
      return true; 
     } 
     else { 
      return false; 
     } 
    } 
    public ChemTokenizer(Reader reader,List<String> additionalKeywords) { 
     int numChars; 
     char[] buffer = new char[1024]; 
     StringBuilder stringBuilder = new StringBuilder(); 
     try { 
      while ((numChars = 
        reader.read(buffer, 0, buffer.length)) != -1) { 
       stringBuilder.append(buffer, 0, numChars); 
      } 
     } 
     catch (IOException e) { 
      throw new RuntimeException(e); 
     } 
     stringToTokenize = stringBuilder.toString(); 
     //Checking for keywords 
     //Doesnt work properly if text has chemical synonyms 
     for (String keyword: additionalKeywords) { 
      int[] tmp = new int[2]; 
      //Start of keyword 
      tmp[0] = stringToTokenize.indexOf(keyword); 
      tmp[1] = tmp[0] + keyword.length() - 1; 
      chemicals.add(tmp); 
     } 
    } 

    /* Reset the stored position for this object when reset() is called. 
    */ 
    @Override 
    public void reset() throws IOException { 
     super.reset(); 
     position = 0; 
     chemicals = new ArrayList<>(); 

    } 
}

和代碼Analyzer：

public class ChemAnalyzer extends Analyzer{ 

    List<String> additionalKeywords; 
    public ChemAnalyzer(List<String> ad) { 
     additionalKeywords = ad; 
    } 
    @Override 
    protected TokenStreamComponents createComponents(String s, Reader reader) { 
     Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords); 
     TokenStream filter = new LowerCaseFilter(tokenizer); 
     return new TokenStreamComponents(tokenizer, filter); 
    } 

}

的問題是，這個代碼不使用Lucene 6

來源

2016-07-30 01ghost13

這是什麼意思，它不使用Lucene 6？編譯錯誤？錯誤？不想要的行爲？ – Mysterion

在lucene 6 createComponents中有不同的描述。 – 01ghost13

工作，這是我在github search發現，你猜你必須創建一個新的標記器沒有閱讀。

@Override 
protected TokenStreamComponents createComponents(String fieldName) { 
    return new TokenStreamComponents(new WhitespaceTokenizer()); }

來源

2016-10-31 09:11:39 Marku

Lucene 6.1自定義Tokenizer和分析器

回答

相關問題