Python nltk計數單詞和短語頻率

我正在使用NLTK並試圖讓單詞短語數達到特定文檔的特定長度以及每個短語的頻率。我將字符串標記爲獲取數據列表。Python nltk計數單詞和短語頻率

from nltk.util import ngrams 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.collocations import * 


data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"] 

bigrams = ngrams(data, 2) 

bigrams_c = {} 
for b in bigrams: 
    if b not in bigrams_c: 
     bigrams_c[b] = 1 
    else: 
     bigrams_c[b] += 1

上面的代碼提供了像這樣的輸出：

(('is', 'this'), 1) 
(('test', 'this'), 2) 
(('a', 'test'), 3) 
(('this', 'is'), 4) 
(('is', 'not'), 1) 
(('real', 'not'), 2) 
(('is', 'real'), 2) 
(('not', 'a'), 3)

這部分我所期待的。

我的問題是，有沒有一種更方便的方法來做到這一點，直到長度爲4或5的短語，而不重複此代碼只是爲了更改計數變量？

來源

2016-11-18 user1610950

既然你標記了這個nltk，下面是如何使用nltk的方法，它比標準python集合中的更多特性。

from nltk import ngrams, FreqDist 
all_counts = dict() 
for size in 2, 3, 4, 5: 
    all_counts[size] = FreqDist(ngrams(data, size))

字典all_counts的每個元素是一個ngram頻率的字典。例如，你可以得到這樣的五個最常見的卦：

all_counts[3].most_common(5)

來源

2016-11-19 13:22:26 alexis

神聖煙，這工作比我以前寫的好多了。非常感謝，精湛的回答！ – user1610950

是的，不要運行此循環，請使用collections.Counter(bigrams)或pandas.Series(bigrams).value_counts()來計算單線計數。

來源

2016-11-18 04:14:01 maxymoo

Python nltk計數單詞和短語頻率

回答

相關問題