2017-10-05 81 views
0

爲什麼輸入myCategorizer.categorize();必須是apache opennlp 1.8中的String [],而不是像apache OpenNLP 1.5版中的字符串? 因爲我想檢查單獨的字符串而不是數組。文檔分類程序openNLP - 分類方法

public void trainModel() 
    { 
     InputStream dataIn = null; 
     try 
     {; 
      dataIn = new FileInputStream("D:/training.txt"); 
      ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); 
      ObjectStream sampleStream = new DocumentSampleStream(lineStream); 
      // Specifies the minimum number of times a feature must be seen 
      int cutoff = 2; 
      int trainingIterations = 30; 
      model = DocumentCategorizerME.train("NL", sampleStream, cutoff,trainingIterations); 


     } 

     catch (IOException e) 
     { 
      e.printStackTrace(); 
     } 

     finally 
     { 
      if (dataIn != null) 
      { 
       try 
       { 
        dataIn.close(); 
       } 
       catch (IOException e) 
       { 
        e.printStackTrace(); 
       } 
      } 
     } 
    } 


public void classifyNewTweet(String tweet) 
{ 
    DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model); 
    double[] outcomes = myCategorizer.categorize(tweet); 
    String category = myCategorizer.getBestCategory(outcomes); 

    if (category.equalsIgnoreCase("1")) 
    { 
     System.out.println("The tweet is positive :) "); 
    } 
    else 
    { 
     System.out.println("The tweet is negative :("); 
    } 
} 

回答

0

回到OpenNLP 1.5的時代,DocumentCatagorizer做的第一件事就是將字符串標記爲單詞。起初,這可能看起來很容易,但是,您可能更喜歡使用最大熵令牌,而不是默認的WhitespaceTokenizer。分詞器可以對分類產生很大的影響。更改API以允許用戶選擇他/她選擇的標記器可緩解問題。只需添加

Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE; 
... 
String[] tokens = tokenizer.tokenize(tweet); 
double[] outcomes = myCategorizer.categorize(tweet); 
... 

這應該可以解決您的問題。您也可以使用統計標記器(請參閱TokenizerME)或SimpleTokenizer。

希望它有幫助...