2015-11-04 92 views
0

我有一個單詞列表和字典:迭代字典中的多個值?

word_list = ["it's","they're","there's","he's"] 

並作爲在words_list的話如何頻繁地出現在幾個文件包含信息的字典:

dict = [('document1',{"it's": 0,"they're": 2,"there's": 5,"he's": 1}), 
('document2',{"it's": 4,"they're": 2,"there's": 3,"he's": 0}), 
('document3',{"it's": 7,"they're": 0,"there's": 4,"he's": 1})] 

我想開發一個數據結構(數據幀,也許?),看起來像如下:

file  word  count 
document1 it's  0 
document1 they're  2 
document1 there's  5 
document1 he's  1 
document2 it's  4 
document2 they're  2 
document2 there's  3 
document2 he's  0 
document3 it's  7 
document3 they're  0 
document3 there's  4 
document3 he's  1 

我試圖找到這些文檔中最常使用的是。我有900多個文件。

我在考慮類似如下:

res = {} 
for i in words_list: 
    count = 0 
    for j in dict.items(): 
     if i == j: 
       count = count + 1 
       res[i,j] = count 

我在哪裏可以從這裏走?

+0

這不是一個字典死心塌地的線條。 – user2357112

+0

您應該使用Python Pandas lib來創建您在帖子中顯示的數據框的類型。 –

+0

我從哪裏開始?我應該看的任何方法? – blacksite

回答

2

好第一件事情,你的字典是不是一個字典,並且現在應建設成爲一個像這樣

d = {'document1':{"it's": 0,"they're": 2,"there's": 5,"he's": 1}, 
    'document2':{"it's": 4,"they're": 2,"there's": 3,"he's": 0}, 
    'document3':{"it's": 7,"they're": 0,"there's": 4,"he's": 1}} 

有,我們實際上我們可以用大熊貓建立一個數據幀一本字典,而是在爲了以你想要的方式獲得它,我們將不得不從字典中建立一個列表清單。然後,我們將創建一個數據框和標記列,然後排序

import collections 
import pandas as pd 

d = {'document1':{"it's": 0,"they're": 2,"there's": 5,"he's": 1}, 
    'document2':{"it's": 4,"they're": 2,"there's": 3,"he's": 0}, 
    'document3':{"it's": 7,"they're": 0,"there's": 4,"he's": 1}} 

d = pd.DataFrame([[k,k1,v1] for k,v in d.items() for k1,v1 in v.items()], columns = ['File','Words','Count']) 
print d.sort(['File','Count'], ascending=[1,1]) 

     File Words Count 
1 document1  it's  0 
0 document1  he's  1 
3 document1 they're  2 
2 document1 there's  5 
4 document2  he's  0 
7 document2 they're  2 
6 document2 there's  3 
5 document2  it's  4 
11 document3 they're  0 
8 document3  he's  1 
10 document3 there's  4 
9 document3  it's  7 

如果你想與前n次出現,那麼你可以使用groupby(),然後要麼排序

d = d.sort(['File','Count'], ascending=[1,1]).groupby('File').head(2) 

     File Words Count 
1 document1  it's  0 
0 document1  he's  1 
4 document2  he's  0 
7 document2 they're  2 
11 document3 they're  0 
8 document3  he's  1 

head() or tail()列表理解返回名單列表,看起來像這樣

d = [['document1', "he's", 1], ['document1', "it's", 0], ['document1', "there's", 5], ['document1', "they're", 2], ['document2', "he's", 0], ['document2', "it's", 4], ['document2', "there's", 3], ['document2', "they're", 2], ['document3', "he's", 1], ['document3', "it's", 7], ['document3', "there's", 4], ['document3', "they're", 0]] 

爲了正確地建立字典,你只需要使用一些東西克

d['document1']['it\'s'] = 1 

如果由於某種原因,你使用STR的元組和類型的字典的列表,你可以使用這個列表理解,而不是

[[i[0],k1,v1] for i in d for k1,v1 in i[1].items()] 
+0

很好的答案。一個問題:'d.sort(['File','Count'],升序= [1,1])'也會改變索引。你爲什麼要這樣做的任何特殊原因? –

+0

@JoeR我只是改變了它,所以文件從低到高的順序,然後設置相同的計數。這不是必要的,但我認爲它看起來好一點。 – SirParselot

1

這樣的事情呢?第一

word_list = ["it's","they're","there's","he's"] 

frequencies = [('document1',{"it's": 0,"they're": 2,"there's": 5,"he's": 1}), 
('document2',{"it's": 4,"they're": 2,"there's": 3,"he's": 0}), 
('document3',{"it's": 7,"they're": 0,"there's": 4,"he's": 1})] 

result = [] 
for document in frequencies: 
    for word in word_list: 
     result.append({"file":document[0], "word":word,"count":document[1][word]}) 

print result 
+0

我得到以下錯誤:'TypeError:字符串索引必須是整數,而不是str'。我不能使用這個詞本身來索引 – blacksite

+0

您是否使用與我相同的數據運行代碼?唯一可能失敗的地方是'document [1] [word]',並且'document [1]'中的所有鍵都是提供的數據中的字符串。不應該失敗。編輯:第二個想到的錯誤意味着你試圖訪問另一個字符串的字符串的元素。你的頻率是否包含任何原始字符串? – Jephron

+0

我不這麼認爲。從字面上看,這雖然比我使用的實際數據簡單得多。它遵循完全相同的語法結構,但「頻率」只是方式更容易談論 – blacksite