查找文件中每個單詞的頻率

我正在嘗試查找文件中每個單詞的頻率。查找文件中每個單詞的頻率

不只是搜索某個單詞有多少個實例，而是每個單詞的頻率。

例如，如果該文件包含了這樣一句話：

！「真棒超級超級酷的人真棒」

它將輸出此：

Super - 2 
Awesome - 2 
Cool - 1 
People - 1 
Are - 1

顯示每個單詞的頻率。

我怎樣才能在Java中做到這一點，但計算整個文件，而不知道我可能測試什麼單詞？

來源

2013-04-22 N01zii

這裏有兩個單獨的問題。將它們分開。使用'Map '，檢查映射是否包含String標記的條目。如果是，則在計數中加1，否則將其設爲1. – 2013-04-22 16:51:29

要查看未加標記的HTML頁面文本，請使用[HtmlUnit]（http://htmlunit.sourceforge.net/）。 HtmlPage類有一個[asText（）]（http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/DomNode.html#asText%28%29）方法。 – 2013-04-22 16:54:17

嘗試以下操作：

// This will match all non-word characters, i.e. characters that are 
// not in [a-zA-Z_0-9]. This should match whitespaces and interpunction. 
String nonWordDelimiter="[\W]+"; 

String[] words = text.split(nonWordDelimiter); 

Map<String, Integer> frequencies = new LinkedHashMap<String, Integer>(); 
for (String word : words) { 
    if (!word.isEmpty()) { 
     Integer frequency = frequencies.get(word); 

     if (frequency == null) { 
      frequency = 0; 
     } 

     ++frequency; 
     frequencies.put(word, frequency); 
    } 
}

最後，地圖frequencies將包含每個單詞的頻率。

來源

2013-04-22 17:15:33

查找文件中每個單詞的頻率

回答

相關問題