我正在解析一長串文本並計算每個單詞在Python中出現的次數。我有一個可行的功能,但我正在尋找建議,以確定是否有方法可以使它更高效(速度方面),以及是否有甚至可以爲我這樣做的Python庫函數,所以我不會重新發明輪子?有效計算字符串中的字詞頻率
您能否提出一種更有效的方法來計算長字符串中出現的最常見單詞(通常在字符串中超過1000個單詞)?
此外什麼是最好的方法來排序詞典到第一個元素是最常用的單詞,第二個元素是第二個最常見的單詞等?
test = """abc def-ghi jkl abc
abc"""
def calculate_word_frequency(s):
# Post: return a list of words ordered from the most
# frequent to the least frequent
words = s.split()
freq = {}
for word in words:
if freq.has_key(word):
freq[word] += 1
else:
freq[word] = 1
return sort(freq)
def sort(d):
# Post: sort dictionary d into list of words ordered
# from highest freq to lowest freq
# eg: For {"the": 3, "a": 9, "abc": 2} should be
# sorted into the following list ["a","the","abc"]
#I have never used lambda's so I'm not sure this is correct
return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))
print calculate_word_frequency(test)
'has_key'已被棄用。改爲使用'd'中的鍵。另外,你的排序功能是非常錯誤的。 'return sorted(d,key = d .__ getitem__,reverse = True)'會按頻率降序排序並返回鍵。 – agf 2012-03-29 06:02:06