2015-10-04 112 views
2

我的數據列表中。統計列表中單詞的頻率並刪除不受歡迎的單詞

data = [['Biz_Innovations', '#socialmedia'], 
['ChantalGrange', '#aws'], 
['beyonddevops', '#aws'], 
['beyonddevops', '#socialmedia'], 
['IBMNetezza', '#ibm'], 
['IBMNetezza', '#analytics'], 
['SandraFeinsmith', '#ibm'], 
['SandraFeinsmith', '#analytics'], 
['fleejack', '#healhcare'], 
['bigdataweek', '#socialmedia'], 
['sabumjung', '#aws']] 

我想計算單詞的頻率在所述第二列(例如,#socialmedia,#aws),然後選擇基於所述該頻率上的行。如果該單詞在數據集中出現三次或更多,我想保留相應的行(並刪除其他行)。所以結果看起來像這樣:

data = [['Biz_Innovations', '#socialmedia'], 
['ChantalGrange', '#aws'], 
['beyonddevops', '#aws'], 
['beyonddevops', '#socialmedia'], 
['bigdataweek', '#socialmedia'], 
['sabumjung', '#aws']] 

有什麼建議嗎?

+0

'collections.Counter(圖(operator.itemgetter(1),數據))'會幫助你很多。 – ozgur

+0

@RobWatts更新。 – ozgur

回答

2
>>> import collections, operator 
>>> words = collections.Counter(map(operator.itemgetter(1), data)) 
>>> populars = [p for p in data if words[p[1]] >= 3] 
+0

感謝您的出色建議! – kevin

1
In [16]: from collections import Counter 

In [17]: keepers = [a[0] for a in Counter(d[1] for d in data).items() if a[1]>=3] 

In [18]: [d for d in data if d[1] in keepers] 
Out[18]: 
[['Biz_Innovations', '#socialmedia'], 
['ChantalGrange', '#aws'], 
['beyonddevops', '#aws'], 
['beyonddevops', '#socialmedia'], 
['bigdataweek', '#socialmedia'], 
['sabumjung', '#aws']] 
+0

感謝您的優秀建議! – kevin

1

您可以使用collections.Counter此:

import collections 
counts = collections.Counter(tag for (_, tag) in data) 
data = [[val, tag] for (val, tag) in data if counts[tag] >= 3] 
+0

感謝您的出色建議! – kevin