0
我正在尋求一些有關Lucene 6.1 API的幫助。Lucene 6.1自定義Tokenizer和分析器
我試圖擴展Lucene的Tokenizer
和Analyzer
,但我不明白所有的指南。在所有教程中,用戶的Tokenizer
將覆蓋增量。在構造函數中,它們有Reader
類,在用戶的Analyzer
類中它們覆蓋了createComponents
方法。但在Lucene中它只有1個字符串參數,所以如何將Reader添加到我的Analyzer
?
我的代碼:
public class ChemTokenizer extends Tokenizer{
protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
protected String stringToTokenize;
protected int position = 0;
protected List<int[]> chemicals = new ArrayList<>();
@Override
public boolean incrementToken() throws IOException {
// Clear anything that is already saved in this.charTermAttribute
this.charTermAttribute.setEmpty();
// Get the position of the next symbol
int nextIndex = -1;
Pattern p = Pattern.compile("[^A-zА-я]");
Matcher m = p.matcher(stringToTokenize.substring(position));
nextIndex = m.start();
// Did we lose chemicals?
for (int[] pair: chemicals) {
if (pair[0] < nextIndex && pair[1] > nextIndex) {
//We are in the chemical name
if (position == pair[0]) {
nextIndex = pair[1];
}
else {
nextIndex = pair[0];
}
}
}
// Next separator was found
if (nextIndex != -1) {
String nextToken = stringToTokenize.substring(position, nextIndex);
charTermAttribute.append(nextToken);
position = nextIndex + 1;
return true;
}
// Last part of text
else if (position < stringToTokenize.length()) {
String nextToken = stringToTokenize.substring(position);
charTermAttribute.append(nextToken);
position = stringToTokenize.length();
return true;
}
else {
return false;
}
}
public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
int numChars;
char[] buffer = new char[1024];
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
reader.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
stringToTokenize = stringBuilder.toString();
//Checking for keywords
//Doesnt work properly if text has chemical synonyms
for (String keyword: additionalKeywords) {
int[] tmp = new int[2];
//Start of keyword
tmp[0] = stringToTokenize.indexOf(keyword);
tmp[1] = tmp[0] + keyword.length() - 1;
chemicals.add(tmp);
}
}
/* Reset the stored position for this object when reset() is called.
*/
@Override
public void reset() throws IOException {
super.reset();
position = 0;
chemicals = new ArrayList<>();
}
}
和代碼Analyzer
:
public class ChemAnalyzer extends Analyzer{
List<String> additionalKeywords;
public ChemAnalyzer(List<String> ad) {
additionalKeywords = ad;
}
@Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
TokenStream filter = new LowerCaseFilter(tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
}
的問題是,這個代碼不使用Lucene 6
這是什麼意思,它不使用Lucene 6?編譯錯誤?錯誤?不想要的行爲? – Mysterion
在lucene 6 createComponents中有不同的描述。 – 01ghost13