2016-11-08 59 views
1

我有幾十個帶有相似(但不總是完全相同)頭文件的csv文件。舉例來說,一個有:在Python中,將具有不同頭文件的多個CSV文件讀取到一個數據幀中

Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site 

一個人:

Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site 

(注意一個缺乏 「U_Global」 和 「U_IR」,其他有 「Diffuse2」 而不是 「漫」)

我知道如何將多個csv傳遞到我的腳本中,但是如何讓csv只將值傳遞給它們當前具有值的列?也許可以將「南」傳遞給該行中的所有其他列。

理想我有類似:

'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site' 
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER" 
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER" 
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT" 
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT" 

另外需要注意的是,這個數據幀需要被追加到。它需要保留多個csv文件傳遞給它。我想我可能會在最後寫出它自己的csv(它最終會轉到NETCDF4)。

回答

1

首先,通過所有的文件運行來定義常見的標題:

csv_path = './csv_files' 
csv_separator = ',' 

full_headers = [] 
for fn in os.listdir(csv_path): 
    with open(fn, 'r') as f: 
     headers = f.readline().split(csv_separator) 
     full_headers += full_headers + list(set(full_headers) - set(headers)) 

然後寫你的標題行到您的輸出文件,並通過所有的文件再次運行來填補它。

您可以使用:csv.DictReader(open('myfile.csv'))以便能夠簡單地將標題與其指定的列進行匹配。

1

假設你有以下CSV文件:

test1.csv:

year,month,day,Direct 
1992,1,1,11 
2013,5,30,11 
2004,9,1,11 

test2.csv:

year,month,day,Direct,Direct2 
1992,1,1,21,201 
2013,5,30,21,202 
2004,9,1,21,203 

test3。CSV:

year,month,day,File3 
1992,1,1,text1 
2013,5,30,text2 
2004,9,1,text3 
2016,1,1,unmatching_date 

解決方案:

import glob 
import pandas as pd 

files = glob.glob(r'd:/temp/test*.csv') 

def get_merged(files, **kwargs): 
    df = pd.read_csv(files[0], **kwargs) 
    for f in files[1:]: 
     df = df.merge(pd.read_csv(f, **kwargs), how='outer') 
    return df 

print(get_merged(files)) 

輸出:

year month day Direct Direct Direct2   File3 
0 1992  1 1  11.0 21.0 201.0   text1 
1 2013  5 30  11.0 21.0 202.0   text2 
2 2004  9 1  11.0 21.0 203.0   text3 
3 2016  1 1  NaN  NaN  NaN unmatching_date 

UPDATE:平常慣用pd.concat(list_of_dfs)解決方案將不會在這裏工作,因爲它是由指標加盟:

In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True) 
Out[192]: 
    Direct Direct Direct2   File3 day month year 
0  NaN  11.0  NaN    NaN 1  1 1992 
1  NaN  11.0  NaN    NaN 30  5 2013 
2  NaN  11.0  NaN    NaN 1  9 2004 
3 21.0  NaN 201.0    NaN 1  1 1992 
4 21.0  NaN 202.0    NaN 30  5 2013 
5 21.0  NaN 203.0    NaN 1  9 2004 
6  NaN  NaN  NaN   text1 1  1 1992 
7  NaN  NaN  NaN   text2 30  5 2013 
8  NaN  NaN  NaN   text3 1  9 2004 
9  NaN  NaN  NaN unmatching_date 1  1 2016 

In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True) 
Out[193]: 
     0 1  2  3  4 5  6  7  8  9 10 11    12 
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1   text1 
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30   text2 
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1   text3 
3  NaN NaN NaN NaN  NaN NaN NaN NaN NaN 2016 1 1 unmatching_date 

或使用index_col=None明確:

In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True) 
Out[194]: 
    Direct Direct Direct2   File3 day month year 
0  NaN  11.0  NaN    NaN 1  1 1992 
1  NaN  11.0  NaN    NaN 30  5 2013 
2  NaN  11.0  NaN    NaN 1  9 2004 
3 21.0  NaN 201.0    NaN 1  1 1992 
4 21.0  NaN 202.0    NaN 30  5 2013 
5 21.0  NaN 203.0    NaN 1  9 2004 
6  NaN  NaN  NaN   text1 1  1 1992 
7  NaN  NaN  NaN   text2 30  5 2013 
8  NaN  NaN  NaN   text3 1  9 2004 
9  NaN  NaN  NaN unmatching_date 1  1 2016 

In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True) 
Out[195]: 
     0 1  2  3  4 5  6  7  8  9 10 11    12 
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1   text1 
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30   text2 
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1   text3 
3  NaN NaN NaN NaN  NaN NaN NaN NaN NaN 2016 1 1 unmatching_date 

以下更地道解決方案的工作,它改變行和列/數據的原始順序:

In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')] 
    ...: 
    ...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs])) 
    ...: 
    ...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index() 
    ...: 
Out[224]: 
    month day year Direct Direct Direct2   File3 
0  1 1 1992  11.0 21.0 201.0   text1 
1  1 1 2016  NaN  NaN  NaN unmatching_date 
2  5 30 2013  11.0 21.0 202.0   text2 
3  9 1 2004  11.0 21.0 203.0   text3 
+0

不使用像這樣的合併在這裏是非慣用和非高性能 - 附加到列表並且concat是patt ern – Jeff

+0

@Jeff,我將如何在使用'concat'的公共列上合併? – MaxU

+0

試試吧,按照定義它會在非concat軸上結合;加入是一個不同的操作 – Jeff

相關問題