2012-03-29 106 views
7

我正在解析一長串文本並計算每個單詞在Python中出現的次數。我有一個可行的功能,但我正在尋找建議,以確定是否有方法可以使它更高效(速度方面),以及是否有甚至可以爲我這樣做的Python庫函數,所以我不會重新發明輪子?有效計算字符串中的字詞頻率

您能否提出一種更有效的方法來計算長字符串中出現的最常見單詞(通常在字符串中超過1000個單詞)?

此外什麼是最好的方法來排序詞典到第一個元素是最常用的單詞,第二個元素是第二個最常見的單詞等?

test = """abc def-ghi jkl abc 
abc""" 

def calculate_word_frequency(s): 
    # Post: return a list of words ordered from the most 
    # frequent to the least frequent 

    words = s.split() 
    freq = {} 
    for word in words: 
     if freq.has_key(word): 
      freq[word] += 1 
     else: 
      freq[word] = 1 
    return sort(freq) 

def sort(d): 
    # Post: sort dictionary d into list of words ordered 
    # from highest freq to lowest freq 
    # eg: For {"the": 3, "a": 9, "abc": 2} should be 
    # sorted into the following list ["a","the","abc"] 

    #I have never used lambda's so I'm not sure this is correct 
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y])) 

print calculate_word_frequency(test) 
+0

'has_key'已被棄用。改爲使用'd'中的鍵。另外,你的排序功能是非常錯誤的。 'return sorted(d,key = d .__ getitem__,reverse = True)'會按頻率降序排序並返回鍵。 – agf 2012-03-29 06:02:06

回答

24

使用collections.Counter

>>> from collections import Counter 
>>> test = 'abc def abc def zzz zzz' 
>>> Counter(test.split()).most_common() 
[('abc', 2), ('zzz', 2), ('def', 2)] 
4
>>>> test = """abc def-ghi jkl abc 
abc""" 
>>> from collections import Counter 
>>> words = Counter() 
>>> words.update(test.split()) # Update counter with words 
>>> words.most_common()  # Print list with most common to least common 
[('abc', 3), ('jkl', 1), ('def-ghi', 1)] 
2

您還可以使用NLTK(自然語言工具包)。它提供了非常好的庫來研究處理文本。 在這個例子中,你可以使用:

from nltk import FreqDist 

text = "aa bb cc aa bb" 
fdist1 = FreqDist(text) 

# show most 10 frequent word in the text 
print fdist1.most_common(10) 

的結果將是:

[('aa', 2), ('bb', 2), ('cc', 1)]