2016-12-28 89 views
1

輸入文件總結一下類似的價值觀:的Python + CSV:從CSV列

$ cat dummy.csv 
OS,A,B,C,D,E 
Ubuntu,0,1,0,1,1 
Windows,0,0,1,1,1 
Mac,1,0,1,0,0 
Ubuntu,1,1,1,1,0 
Windows,0,0,1,1,0 
Mac,1,0,1,1,1 
Ubuntu,0,1,0,1,1 
Ubuntu,0,0,1,1,1 
Ubuntu,1,0,1,0,0 
Ubuntu,1,1,1,1,0 
Mac,0,0,1,1,0 
Mac,1,0,1,1,1 
Windows,1,1,1,1,0 
Ubuntu,0,0,1,1,0 
Windows,1,0,1,1,1 
Mac,0,1,0,1,1 
Windows,0,0,1,1,1 
Mac,1,0,1,0,0 
Windows,1,1,1,1,0 
Mac,0,0,1,1,0 

預期輸出:

OS,A,B,C,D,E 
Mac,4,1,6,5,3 
Ubuntu,3,4,5,6,3 
Windows,3,2,6,6,3 

我使用Excel的數據透視表上面的輸出產生。

mycode的:

import csv 
import pprint 
from collections import defaultdict 

d = defaultdict(dict) 

with open('dummy.csv') as csvfile: 
    reader = csv.DictReader(csvfile) 
    for row in reader: 
     d[row['OS']]['A'] += row['A'] 
     d[row['OS']]['B'] += row['B'] 
     d[row['OS']]['C'] += row['C'] 
     d[row['OS']]['D'] += row['D'] 
     d[row['OS']]['E'] += row['E'] 

pprint.pprint(d) 

錯誤:

$ python3 dummy.py 
Traceback (most recent call last): 
    File "dummy.py", line 10, in <module> 
    d[row['OS']]['A'] += row['A'] 
KeyError: 'A' 

我的想法是讓累加到字典中的CSV值稍後打印。但是,當我嘗試添加值時,出現以上錯誤。

這似乎可以通過內置的csv模塊實現。我認爲這是一本容易些:(任何指針將有很大的幫助。

回答

1

有兩個問題:嵌套字典最初沒有設置任何鍵,因此d[row[OS]]['A']會導致錯誤;另一個問題是您需要在添加列值之前將列值轉換爲int

您可以使用Counter以來有丟失的鑰匙默認defaultdict0

import csv 
from collections import Counter, defaultdict 

d = defaultdict(Counter) 

with open('dummy.csv') as csvfile: 
    reader = csv.DictReader(csvfile) 

    for row in reader: 
     nested = d[row.pop('OS')] 
     for k, v in row.items(): 
      nested[k] += int(v) 

print(*d.items(), sep='\n') 

輸出:

('Ubuntu', Counter({'D': 6, 'C': 5, 'B': 4, 'E': 3, 'A': 3})) 
('Windows', Counter({'C': 6, 'D': 6, 'E': 3, 'A': 3, 'B': 2})) 
('Mac', Counter({'C': 6, 'D': 5, 'A': 4, 'E': 3, 'B': 1})) 
0

d是一本字典,所以d[row['OS']]是有效的表達式,但d[row['OS']]['A']預計字典項是某種類型的集合。既然你沒」 t提供默認值,它將代替None,這不是

1

這不回答你的問題完全相同,因爲它確實是可以解決使用csv問題,但值得一提的是pandas非常適合這樣的事情:

In [1]: import pandas as pd 

In [2]: df = pd.read_csv('dummy.csv') 

In [3]: df.groupby('OS').sum() 
Out[3]: 
     A B C D E 
OS 
Mac  4 1 6 5 3 
Ubuntu 3 4 5 6 3 
Windows 3 2 6 6 3 
+0

1。但是,我更喜歡'csv'這個工作,因爲這樣可以避免安裝一個新的包,這對我正在使用的服務器來說是不實際的。 – slayedbylucifer

1

Somethin像這樣?您可以將數據框寫入csv文件以獲得所需的格式。

import pandas as pd 
# df0=pd.read_clipboard(sep=',') 
# df0 
df=df0.copy() 
df=df.groupby(by='OS').sum() 
print df 

輸出:

  A B C D E 
OS      
Mac  4 1 6 5 3 
Ubuntu 3 4 5 6 3 
Windows 3 2 6 6 3 

df.to_csv('file01') 

file01

OS,A,B,C,D,E 
Mac,4,1,6,5,3 
Ubuntu,3,4,5,6,3 
Windows,3,2,6,6,3 
+0

+1。但是,我更喜歡'csv'這個工作,因爲這樣可以避免安裝一個新的包,這對我正在使用的服務器來說是不實際的。 – slayedbylucifer

+0

@slayedbylucifer有道理。但是如果你必須做很多這些csv任務,那麼'pandas'是你最好的選擇。 – MYGz

1

你明白我的異常,因爲是第一次,row['OS']不存在d,所以'A'不存在於d[row['OS']]中。嘗試以下來修復:

import csv 
from collections import defaultdict 

d = defaultdict(dict) 

with open('dummy.csv') as csvfile: 
    reader = csv.DictReader(csvfile) 
    for row in reader: 
     d[row['OS']]['A'] = d[row['OS']]['A'] + int(row['A']) if (row['OS'] in d and 'A' in d[row['OS']]) else int(row['A']) 
     d[row['OS']]['B'] = d[row['OS']]['B'] + int(row['B']) if (row['OS'] in d and 'B' in d[row['OS']]) else int(row['B']) 
     d[row['OS']]['C'] = d[row['OS']]['C'] + int(row['C']) if (row['OS'] in d and 'C' in d[row['OS']]) else int(row['C']) 
     d[row['OS']]['D'] = d[row['OS']]['D'] + int(row['D']) if (row['OS'] in d and 'D' in d[row['OS']]) else int(row['D']) 
     d[row['OS']]['E'] = d[row['OS']]['E'] + int(row['E']) if (row['OS'] in d and 'E' in d[row['OS']]) else int(row['E']) 

輸出:

>>> import pprint 
>>> 
>>> pprint.pprint(dict(d)) 
{'Mac': {'A': 4, 'B': 1, 'C': 6, 'D': 5, 'E': 3}, 
'Ubuntu': {'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 3}, 
'Windows': {'A': 3, 'B': 2, 'C': 6, 'D': 6, 'E': 3}} 
+0

+1。我從來沒有意識到鑰匙空置在第一位。可能是因爲我在perl中使用了'autovivification'。現在我明白我錯過了什麼。 – slayedbylucifer

0

這擴展niemmi'ssolution格式化輸出是相同OP'sexample

import csv 
from collections import Counter, defaultdict 

d = defaultdict(Counter) 
with open('dummy.csv') as csv_file: 
    reader = csv.DictReader(csv_file) 
    field_names = reader.fieldnames 
    for row in reader: 
     counter = d[row.pop('OS')] 
     for key, value in row.iteritems(): 
      counter[key] += int(value) 

print ','.join(field_names) 
for os, counter in sorted(d.iteritems()): 
    print "%s,%s" % (os, ','.join([str(v) for k, v in sorted(counter.iteritems())])) 

輸出

OS,A,B,C,D,E 
Mac,4,1,6,5,3 
Ubuntu,3,4,5,6,3 
Windows,3,2,6,6,3 

更新:固定輸出。

+0

由於輸出錯誤,排序/加入上述代碼時出現錯誤。 – slayedbylucifer

+0

謝謝。我忘了整理櫃檯。 –

0

我假設你的輸入文件被稱爲input_file.csv

還可以處理數據,並使用從groupby模塊itertools和​​如下面的例子有所需輸出:

from itertools import groupby 

data = list(k.strip("\n").split(",") for k in open("input_file.csv", 'r')) 

a, b = {}, {} 
for k, v in groupby(data[1:], lambda x : x[0]): 
    try: 
     a[k] += [i[1:] for i in list(v)] 
    except KeyError: 
     a[k] = [i[1:] for i in list(v)] 

for key in a.keys(): 
    for j in range(5): 
     c = 0 
     for i in a[key]: 
      c += int(i[j]) 
     try: 
      b[key] += ',' + str(c) 
     except KeyError: 
      b[key] = str(c) 

輸出:

print(','.join(data[0])) 
for k in b.keys(): 
    print("{0},{1}".format(k, b[k])) 

>>> OS,A,B,C,D,E 
>>> Ubuntu,3,4,5,6,3 
>>> Windows,3,2,6,6,3 
>>> Mac,4,1,6,5,3