有效計算字符串中的字詞頻率

我正在解析一長串文本並計算每個單詞在Python中出現的次數。我有一個可行的功能，但我正在尋找建議，以確定是否有方法可以使它更高效（速度方面），以及是否有甚至可以爲我這樣做的Python庫函數，所以我不會重新發明輪子？有效計算字符串中的字詞頻率

您能否提出一種更有效的方法來計算長字符串中出現的最常見單詞（通常在字符串中超過1000個單詞）？

此外什麼是最好的方法來排序詞典到第一個元素是最常用的單詞，第二個元素是第二個最常見的單詞等？

test = """abc def-ghi jkl abc 
abc""" 

def calculate_word_frequency(s): 
    # Post: return a list of words ordered from the most 
    # frequent to the least frequent 

    words = s.split() 
    freq = {} 
    for word in words: 
     if freq.has_key(word): 
      freq[word] += 1 
     else: 
      freq[word] = 1 
    return sort(freq) 

def sort(d): 
    # Post: sort dictionary d into list of words ordered 
    # from highest freq to lowest freq 
    # eg: For {"the": 3, "a": 9, "abc": 2} should be 
    # sorted into the following list ["a","the","abc"] 

    #I have never used lambda's so I'm not sure this is correct 
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y])) 

print calculate_word_frequency(test)

來源

2012-03-29 Jake M

'has_key'已被棄用。改爲使用'd'中的鍵。另外，你的排序功能是非常錯誤的。 'return sorted（d，key = d .__ getitem__，reverse = True）'會按頻率降序排序並返回鍵。 – agf 2012-03-29 06:02:06

使用collections.Counter：

>>> from collections import Counter 
>>> test = 'abc def abc def zzz zzz' 
>>> Counter(test.split()).most_common() 
[('abc', 2), ('zzz', 2), ('def', 2)]

來源

2012-03-29 05:39:27

>>>> test = """abc def-ghi jkl abc 
abc""" 
>>> from collections import Counter 
>>> words = Counter() 
>>> words.update(test.split()) # Update counter with words 
>>> words.most_common()  # Print list with most common to least common 
[('abc', 3), ('jkl', 1), ('def-ghi', 1)]

來源

2012-03-29 05:38:36 jamylak

您還可以使用NLTK（自然語言工具包）。它提供了非常好的庫來研究處理文本。在這個例子中，你可以使用：

from nltk import FreqDist 

text = "aa bb cc aa bb" 
fdist1 = FreqDist(text) 

# show most 10 frequent word in the text 
print fdist1.most_common(10)

的結果將是：

[('aa', 2), ('bb', 2), ('cc', 1)]

來源

2014-10-06 09:11:20

有效計算字符串中的字詞頻率

回答

相關問題