2017-10-04 92 views
1

二元語法我需要:1。 形成二元對,它們存儲在表ID 2.找到總和,其中有аrе前3兩字頻率最高句子的列表形成的話二元語法和計算使用python

我有句子的列表:

[['22574999', 'your message communication sent'] 
, ['22582857', 'your message be delivered'] 
, ['22585166', 'message has be delivered'] 
, ['22585424', 'message originated communication sent']] 

這裏是我做過什麼:

for row in messages: 
    sstrm = list(row) 
    bigrams=[b for l in sstrm for b in zip(l.split(" ")[:1], l.split(" ")[1:])] 
    print(sstrm[0],bigrams) 

這將產生:

22574999 [('your', 'message')] 
22582857 [('[your', 'message')] 
22585166 [('message', 'has')] 
22585424 [('message', 'originated')] 

我要的是:

22574999 [('your', 'message'),('communication','sent')] 
22582857 [('[your', 'message'),('be','delivered')] 
22585166 [('message', 'has'),('be','delivered')] 
22585424 [('message', 'originated'),('communication','sent')] 

我希望得到以下結果 結果:

前3名的雙字母組頻率最高:

('your', 'message') :2 
('communication','sent'):2  
('be','delivered'):2 

的總和其中有最高頻率最高的三個bigrams:

('your', 'message'):2   Is included (22574999,22582857)  
('communication','sent'):2  Is included(22574999,22585424) 
('be','delivered'):2   Is included (22582857,22585166) 

感謝您的幫助!

回答

1

我想指出的第一件事是,二元語法是兩個相鄰的元素序列。

例如,對二元語法「狐狸跳過了懶狗」是:

[("the", "fox"),("fox", "jumped"),("jumped", "over"),("over", "the"),("the", "lazy"),("lazy", "dog")]

這個問題可以用inverted index,其中二元語法是帖子進行建模並且該組ID是發佈列表。

def bigrams(line): 
    tokens = line.split(" ") 
    return [(tokens[i], tokens[i+1]) for i in range(0, len(tokens)-1)] 


if __name__ == "__main__": 
    messages = [['22574999', 'your message communication sent'], ['22582857', 'your message be delivered'], ['22585166', 'message has be delivered'], ['22585424', 'message originated communication sent']] 
    bigrams_set = set() 

    for row in messages: 
     l_bigrams = bigrams(row[1]) 
     for bigram in l_bigrams: 
      bigrams_set.add(bigram) 

    inverted_idx = dict((b,[]) for b in bigrams_set) 

    for row in messages: 
     l_bigrams = bigrams(row[1]) 
     for bigram in l_bigrams: 
      inverted_idx[bigram].append(row[0]) 

    freq_bigrams = dict((b,len(ids)) for b,ids in inverted_idx.items()) 
    import operator 
    top3_bigrams = sorted(freq_bigrams.iteritems(), key=operator.itemgetter(1), reverse=True)[:3] 

輸出

[(('communication', 'sent'), 2), (('your', 'message'), 2), (('be', 'delivered'), 2)] 

儘管這段代碼可以通過大量的優化,它給你的想法。

0

你在這一行錯誤:

bigrams=[b for l in sstrm for b in zip(l.split(" ")[:1], l.split(" ")[1:])] 

在你在與[:1]列表的第一個元素停止壓縮你的第一個參數。您想獲取除最後一個元素之外的所有元素,這對應於[:-1]

所以行應該是這樣的:

bigrams=[b for l in sstrm for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]