2012-01-13 52 views
1

我正在嘗試執行各種文本流的「翻譯」。更具體地說,我需要標記輸入流,查找專用字典中的每個術語並輸出令牌的相應「翻譯」。但是,我還想保留輸入中的所有原始空格,停用詞等,以便輸出的格式與輸入相同,而不是最終成爲翻譯流。所以如果我的輸入是Lucene:如何在標記流時保留空格等?

Term1:Term2停用詞! TERM3 Term4

那麼我想輸出看起來像

起租1 ':詞條2' 停用詞! TERM3' Term4'

(其中TERMI」TERMI翻譯),而不是簡單地

起租1' 詞條2' TERM3' Term4'

目前我做了以下:

PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31, 
          PatternAnalyzer.WHITESPACE_PATTERN, 
          false, 
          WordlistLoader.getWordSet(new File(stopWordFilePath))); 
TokenStream ts = pa.tokenStream(null, in); 
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class); 

while (ts.incrementToken()) { // loop over tokens 
    String termIn = charTermAttribute.toString(); 
    ... 
} 

但這當然失去了一切espaces等。我怎樣才能修改這個能夠重新插入到輸出中?非常感謝!

============更新!

我嘗試將原始流拆分爲「單詞」和「非單詞」。它似乎工作正常。不能確定它是否是最有效的方式,但:

public ArrayList splitToWords(String sIn) {

if (sIn == null || sIn.length() == 0) { 
    return null; 
} 

char[] c = sIn.toCharArray(); 
ArrayList<Token> list = new ArrayList<Token>(); 
int tokenStart = 0; 
boolean curIsLetter = Character.isLetter(c[tokenStart]); 
for (int pos = tokenStart + 1; pos < c.length; pos++) { 
    boolean newIsLetter = Character.isLetter(c[pos]); 
    if (newIsLetter == curIsLetter) { 
     continue; 
    } 
    TokenType type = TokenType.NONWORD; 
    if (curIsLetter == true) 
    { 
     type = TokenType.WORD; 
    } 

    list.add(new Token(new String(c, tokenStart, pos - tokenStart),type)); 
    tokenStart = pos; 

    curIsLetter = newIsLetter; 
} 
TokenType type = TokenType.NONWORD; 
if (curIsLetter == true) 
{ 
    type = TokenType.WORD; 
} 
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type)); 

return list; 

}

+0

你翻譯一段文字,但是這與Lucene有什麼關係? – milan 2012-01-14 10:26:38

+0

@milan實際的翻譯是通過搜索由Lucene索引的詞典完成的 – 2012-01-16 17:18:29

+0

我做了一個包含任何標記生成器的包裝器,生成帶有「缺失標記」的標記流。它是尚未開源的更大項目的一部分,所以如果你需要的話,請糾正錯誤。 – fulmicoton 2015-04-24 01:24:32

回答

0

那麼它並沒有真正失去的空白,你仍然有原文:)

所以我認爲你應該使用OffsetAttribute的,其中包含每個術語的startOffset()和endOffset()到您的原始文本。例如,這是lucene用來突出顯示來自原始文本的搜索結果的片段。

我寫了一個快速測試(使用EnglishAnalyzer)證明: 輸入是:

Just a test of some ideas. Let's see if it works. 

輸出是:

just a test of some idea. let see if it work. 

// just for example purposes, not necessarily the most performant. 
public void testString() throws Exception { 
    String input = "Just a test of some ideas. Let's see if it works."; 
    EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35); 
    StringBuilder output = new StringBuilder(input); 
    // in some cases, the analyzer will make terms longer or shorter. 
    // because of this we must track how much we have adjusted the text so far 
    // so that the offsets returned will still work for us via replace() 
    int delta = 0; 

    TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input)); 
    CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); 
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); 
    ts.reset(); 
    while (ts.incrementToken()) { 
    String term = termAtt.toString(); 
    int start = offsetAtt.startOffset(); 
    int end = offsetAtt.endOffset(); 
    output.replace(delta + start, delta + end, term); 
    delta += (term.length() - (end - start)); 
    } 
    ts.close(); 

System.out.println(output.toString()); 

}

+0

謝謝。在閱讀您的回覆之前,我嘗試了這一點: – 2012-01-16 17:13:43

相關問題