如何從文本中經常存在的話使用蒂卡

提取我已經提取使用下面的代碼多種文件格式（PDF，HTML，DOC）文本（使用蒂卡）如何從文本中經常存在的話使用蒂卡

File file1 = new File("c://sample.pdf); 
InputStream input = new FileInputStream(file1); 
BodyContentHandler handler = new BodyContentHandler(10*1024*1024); 
JSONObject obj = new JSONObject(); 
obj.put("Content",handler.toString());

現在，我的要求是從提取的內容中獲取經常出現的單詞，你能告訴我如何做到這一點。

感謝

來源

2013-07-03 user2545106

內容是JSON？ – vidit

是的內容存儲在json對象中 – user2545106

下面就來最頻繁的詞功能。

您需要將內容傳遞給該函數，並獲得經常出現的單詞。

String getMostFrequentWord(String input) { 
    String[] words = input.split(" "); 
    // Create a dictionary using word as key, and frequency as value 
    Map<String, Integer> dictionary = new HashMap<String, Integer>(); 
    for (String word : words) { 
     if (dictionary.containsKey(word)) { 
      int frequency = dictionary.get(word); 
      dictionary.put(word, frequency + 1); 
     } else { 
      dictionary.put(word, 1); 
     } 
    } 

    int max = 0; 
    String mostFrequentWord = ""; 
    Set<Entry<String, Integer>> set = dictionary.entrySet(); 
    for (Entry<String, Integer> entry : set) { 
     if (entry.getValue() > max) { 
      max = entry.getValue(); 
      mostFrequentWord = entry.getKey(); 
     } 
    } 

    return mostFrequentWord; 
}

該算法是O（n）所以性能應該沒問題。

來源

2013-07-03 06:03:30 Mingyu

如何從文本中經常存在的話使用蒂卡

回答

相關問題