2017-04-05 52 views
2

有人知道是否可以從n克詞彙表中計數,這些數字在幾個不同的標記列表中出現多少次?詞彙表中列出了n克,其中每個獨特n克列出一次。如果我有:我如何計算許多列表中的n-gram發生次數

名單

['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay'] //1 

['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay'] //2 

<type = list> 

的N-gram詞彙

('hello','I') 
('I', 'am') 
('am', 'doing') 
('doing', 'okay') 
('okay','are') 
('hello', 'how') 
('how', 'are') 
('are','you') 
('you', 'doing') 
('doing', 'today') 
('today', 'are') 
('you', 'okay') 
<type = tupels> 

然後,我所要的輸出是這樣的:

列表1:

('hello', 'how')1 
('how', 'are')1 
('are','you')2 
('you', 'doing')1 
('doing', 'today')1 
('today', 'are')1 
('you', 'okay')1 

列表2:

('hello','I')1 
('I', 'am')1 
('am', 'doing')1 
('doing', 'okay')1 
('okay','are')1 
('are','you')1 
('you', 'okay')1 

我有以下代碼:

test_tokenized = [word_tokenize(i) for i in test_lower] 

for test_toke in test_tokenized: 

    filtered_words = [word for word in test_toke if word not in stopwords.words('english')] 

    bigram = bigrams(filtered_words) 

    fdist = nltk.FeatDict(bigram) 

    for k,v in fdist.items(): 
     #print (k,v) 
     occur = (k,v) 

回答

3

使用列表解析生成的n-gram和collections.Counter計數重複:

from collections import Counter 
l = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay'] 
ngrams = [(l[i],l[i+1]) for i in range(len(l)-1)] 
print Counter(ngrams) 
1

我會建議使用一個for循環使用範圍:

from collections import Counter 
list1 = ['hello', 'how', 'are', 'you', 'doing', 'today', 'are', 'you', 'okay'] 
list2 = ['hello', 'I', 'am', 'doing', 'okay', 'are', 'you', 'okay'] 

def ngram(li): 
    result = [] 
    for i in range(len(li)-1): 
     result.append((li[i], li[i+1])) 
    return Counter(result) 

print(ngram(list1)) 
print(ngram(list2)) 
相關問題