2016-07-30 74 views
0

我正在尋求一些有關Lucene 6.1 API的幫助。Lucene 6.1自定義Tokenizer和分析器

我試圖擴展Lucene的TokenizerAnalyzer,但我不明白所有的指南。在所有教程中,用戶的Tokenizer將覆蓋增量。在構造函數中,它們有Reader類,在用戶的Analyzer類中它們覆蓋了createComponents方法。但在Lucene中它只有1個字符串參數,所以如何將Reader添加到我的Analyzer

我的代碼:

public class ChemTokenizer extends Tokenizer{ 
    protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class); 
    protected String stringToTokenize; 
    protected int position = 0; 
    protected List<int[]> chemicals = new ArrayList<>(); 

    @Override 
    public boolean incrementToken() throws IOException { 
     // Clear anything that is already saved in this.charTermAttribute 
     this.charTermAttribute.setEmpty(); 

     // Get the position of the next symbol 
     int nextIndex = -1; 
     Pattern p = Pattern.compile("[^A-zА-я]"); 
     Matcher m = p.matcher(stringToTokenize.substring(position)); 
     nextIndex = m.start(); 
     // Did we lose chemicals? 
     for (int[] pair: chemicals) { 
      if (pair[0] < nextIndex && pair[1] > nextIndex) { 
       //We are in the chemical name 
       if (position == pair[0]) { 
        nextIndex = pair[1]; 
       } 
       else { 
        nextIndex = pair[0]; 
       } 
      } 
     } 
     // Next separator was found 
     if (nextIndex != -1) { 
      String nextToken = stringToTokenize.substring(position, nextIndex); 
      charTermAttribute.append(nextToken); 
      position = nextIndex + 1; 
      return true; 
     } 
     // Last part of text 
     else if (position < stringToTokenize.length()) { 
      String nextToken = stringToTokenize.substring(position); 
      charTermAttribute.append(nextToken); 
      position = stringToTokenize.length(); 
      return true; 
     } 
     else { 
      return false; 
     } 
    } 
    public ChemTokenizer(Reader reader,List<String> additionalKeywords) { 
     int numChars; 
     char[] buffer = new char[1024]; 
     StringBuilder stringBuilder = new StringBuilder(); 
     try { 
      while ((numChars = 
        reader.read(buffer, 0, buffer.length)) != -1) { 
       stringBuilder.append(buffer, 0, numChars); 
      } 
     } 
     catch (IOException e) { 
      throw new RuntimeException(e); 
     } 
     stringToTokenize = stringBuilder.toString(); 
     //Checking for keywords 
     //Doesnt work properly if text has chemical synonyms 
     for (String keyword: additionalKeywords) { 
      int[] tmp = new int[2]; 
      //Start of keyword 
      tmp[0] = stringToTokenize.indexOf(keyword); 
      tmp[1] = tmp[0] + keyword.length() - 1; 
      chemicals.add(tmp); 
     } 
    } 

    /* Reset the stored position for this object when reset() is called. 
    */ 
    @Override 
    public void reset() throws IOException { 
     super.reset(); 
     position = 0; 
     chemicals = new ArrayList<>(); 

    } 
} 

和代碼Analyzer

public class ChemAnalyzer extends Analyzer{ 

    List<String> additionalKeywords; 
    public ChemAnalyzer(List<String> ad) { 
     additionalKeywords = ad; 
    } 
    @Override 
    protected TokenStreamComponents createComponents(String s, Reader reader) { 
     Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords); 
     TokenStream filter = new LowerCaseFilter(tokenizer); 
     return new TokenStreamComponents(tokenizer, filter); 
    } 

} 

的問題是,這個代碼不使用Lucene 6

+0

這是什麼意思,它不使用Lucene 6?編譯錯誤?錯誤?不想要的行爲? – Mysterion

+0

在lucene 6 createComponents中有不同的描述。 – 01ghost13

回答

0

工作,這是我在github search發現,你猜你必須創建一個新的標記器沒有閱讀。

@Override 
protected TokenStreamComponents createComponents(String fieldName) { 
    return new TokenStreamComponents(new WhitespaceTokenizer()); }