計數詞的頻率在大熊貓數據幀

我有一個表像下面：計數詞的頻率在大熊貓數據幀

 URN     Firm_Name 
0 104472    R.X. Yah & Co 
1 104873  Big Building Society 
2 109986   St James's Society 
3 114058 The Kensington Society Ltd 
4 113438  MMV Oil Associates Ltd

而且我想算Firm_Name列中的所有單詞的頻率，以獲得一個輸出象下面這樣：

我曾嘗試下面的代碼：

import pandas as pd 
import nltk 
data = pd.read_csv("X:\Firm_Data.csv") 
top_N = 20 
word_dist = nltk.FreqDist(data['Firm_Name']) 
print('All frequencies') 
print('='*60) 
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency']) 

print(rslt) 
print ('='*60)

但是，以下代碼不會生成唯一的字數。

來源

2017-10-17 J Reza

IIUIC，使用value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts() 
Out[3361]: 
Society  3 
Ltd   2 
James's  1 
R.X.   1 
Yah   1 
Associates 1 
St   1 
Kensington 1 
MMV   1 
Big   1 
&    1 
The   1 
Co   1 
Oil   1 
Building  1 
dtype: int64

或者，

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()

或者，

pd.Series(' '.join(df.Firm_Name).split()).value_counts()

對於前N個，例如3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3] 
Out[3379]: 
Society 3 
Ltd  2 
James's 1 
dtype: int64

詳細

In [3380]: df 
Out[3380]: 
     URN     Firm_Name 
0 104472    R.X. Yah & Co 
1 104873  Big Building Society 
2 109986   St James's Society 
3 114058 The Kensington Society Ltd 
4 113438  MMV Oil Associates Ltd

來源

2017-10-17 08:58:07 Zero

我一定會接受你的答案。我正在等待，以便爲開放答案方面提供幫助。 – piRSquared

'.split（expand = True）.stack（）'對於小數據來說是一個非常聰明的選擇，但它會在任何大小的數據上快速耗盡內存。由於它爲'Firm_Name'中的每個唯一字詞擴展了一個矩陣，因此數據稀疏性會在沒有很多觀察的情況下爆炸矩陣列。 –

您需要str.cat與lower首先爲concanecate所有值設置爲一個string，則需要word_tokenize和最後一次使用您的解決方案：

top_N = 4 
#if not necessary all lower 
a = data['Firm_Name'].str.lower().str.cat(sep=' ') 
words = nltk.tokenize.word_tokenize(a) 
word_dist = nltk.FreqDist(words) 
print (word_dist) 
<FreqDist with 17 samples and 20 outcomes> 

rslt = pd.DataFrame(word_dist.most_common(top_N), 
        columns=['Word', 'Frequency']) 
print(rslt) 
     Word Frequency 
0 society   3 
1  ltd   2 
2  the   1 
3  co   1

也可以刪除lower如果必要的話：

top_N = 4 
a = data['Firm_Name'].str.cat(sep=' ') 
words = nltk.tokenize.word_tokenize(a) 
word_dist = nltk.FreqDist(words) 
rslt = pd.DataFrame(word_dist.most_common(top_N), 
        columns=['Word', 'Frequency']) 
print(rslt) 
     Word Frequency 
0  Society   3 
1   Ltd   2 
2   MMV   1 
3 Kensington   1

來源

2017-10-17 09:05:32 jezrael

感謝這麼多偉大的解決方案 –

再次作出致謝 –

他們都工作雖然 –

計數詞的頻率在大熊貓數據幀

回答

相關問題