2016-12-15 57 views
2

我正在計算文字的頻率到很多文本文件(140個文檔),我的工作的最後是創建一個csv文件,我可以通過單個文檔和所有文檔來排序每個單詞的頻率。從多個字典創建一個csv文件?

讓說我有:

absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005} 
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7} 
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6} 
... 
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9} 

所以,我需要的是一個CVS Excel文件導出,看起來像這樣:

WORD, ABS_FREQ, DOC_1_FREQ, DOC_2_FREQ, ..., DOC_140_FREQ 
hello, 0.001  0.8   0.2    0.1 
world, 0.002  0.9   0.03    0.5 
baby, 0.005  0.7   0.6    0.9 

我怎麼能做到這一點與Python?

+1

看看'csv.DictWriter' https://docs.python.org/3/library/csv.html#csv.DictWriter –

回答

2

您可以通過變量,先用它列出的所有數據創建一個table,然後使用csv模塊,使之成爲大部分數據驅動的過程,只給出了變量的所有字典的名字寫一個轉置(交換行的列)版本到輸出文件。

import csv 

absolut_freq = {u'hello': 0.001, u'world': 0.002, u'baby': 0.005} 
doc_1 = {u'hello': 0.8, u'world': 0.9, u'baby': 0.7} 
doc_2 = {u'hello': 0.2, u'world': 0.3, u'baby': 0.6} 
doc_140 ={u'hello': 0.1, u'world': 0.5, u'baby': 0.9} 

dic_names = ('absolut_freq', 'doc_1', 'doc_2', 'doc_140') # dict variable names 

namespace = globals() 
words = namespace[dic_names[0]].keys() # assume dicts all contain the same words 
table = [['WORD'] + list(words)] # header row (becomes first column of output) 

for dic_name in dic_names: # add values from each dictionary given its name 
    table.append([dic_name.upper()+'_FREQ'] + list(namespace[dic_name].values())) 

# Use open('merged_dicts.csv', 'wb') for Python 2. 
with open('merged_dicts.csv', 'w', newline='') as csvfile: 
    csv.writer(csvfile).writerows(zip(*table)) 

print('done') 

CSV文件製作:

WORD,ABSOLUT_FREQ_FREQ,DOC_1_FREQ,DOC_2_FREQ,DOC_140_FREQ 
world,0.002,0.9,0.3,0.5 
baby,0.005,0.7,0.6,0.9 
hello,0.001,0.8,0.2,0.1 
2

不管你想怎麼寫這個數據,首先你需要一個有序的數據結構,例如2D名單:

docs = [] 
docs.append({u'hello':0.001, u'world':0.002, u'baby':0.005}) 
docs.append({u'hello':0.8, u'world':0.9, u'baby':0.7}) 
docs.append({u'hello':0.2, u'world':0.3, u'baby':0.6}) 
docs.append({u'hello':0.1, u'world':0.5, u'baby':0.9}) 
words = docs[0].keys() 
result = [ [word] + [ doc[word] for doc in docs ] for word in words ] 

那麼你可以使用內置的CSV模塊:https://docs.python.org/2/library/csv.html

3

您也可以將其轉換爲Pandas Dataframe並將其保存爲csv文件或繼續以乾淨格式進行分析。

absolut_freq= {u'hello':0.001, u'world':0.002, u'baby':0.005} 
doc_1= {u'hello':0.8, u'world':0.9, u'baby':0.7} 
doc_2= {u'hello':0.2, u'world':0.3, u'baby':0.6} 
doc_140={u'hello':0.1, u'world':0.5, u'baby':0.9} 


all = [absolut_freq, doc_1, doc_2, doc_140] 

# if you have a bunch of docs, you could use enumerate and then format the colname as you iterate over and create the dataframe 
colnames = ['AbsoluteFreq', 'Doc1', 'Doc2', 'Doc140'] 


import pandas as pd 

masterdf = pd.DataFrame() 

for i in all: 
    df = pd.DataFrame([i]).T 
    masterdf = pd.concat([masterdf, df], axis=1) 

# assign the column names 
masterdf.columns = colnames 

# get a glimpse of what the data frame looks like 
masterdf.head() 

# save to csv 
masterdf.to_csv('docmatrix.csv', index=True) 

# and to sort the dataframe by frequency 
masterdf.sort(['AbsoluteFreq']) 
+0

感謝datawrestler! !它運作良好! – CosimoCD