在Python中，將具有不同頭文件的多個CSV文件讀取到一個數據幀中

我有幾十個帶有相似（但不總是完全相同）頭文件的csv文件。舉例來說，一個有：在Python中，將具有不同頭文件的多個CSV文件讀取到一個數據幀中

Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site

一個人：

Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site

（注意一個缺乏「U_Global」和「U_IR」，其他有「Diffuse2」而不是「漫」）

我知道如何將多個csv傳遞到我的腳本中，但是如何讓csv只將值傳遞給它們當前具有值的列？也許可以將「南」傳遞給該行中的所有其他列。

理想我有類似：

'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site' 
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER" 
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER" 
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT" 
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"

另外需要注意的是，這個數據幀需要被追加到。它需要保留多個csv文件傳遞給它。我想我可能會在最後寫出它自己的csv（它最終會轉到NETCDF4）。

來源

2016-11-08 Franklin Harvey

首先，通過所有的文件運行來定義常見的標題：

csv_path = './csv_files' 
csv_separator = ',' 

full_headers = [] 
for fn in os.listdir(csv_path): 
    with open(fn, 'r') as f: 
     headers = f.readline().split(csv_separator) 
     full_headers += full_headers + list(set(full_headers) - set(headers))

然後寫你的標題行到您的輸出文件，並通過所有的文件再次運行來填補它。

您可以使用：csv.DictReader(open('myfile.csv'))以便能夠簡單地將標題與其指定的列進行匹配。

來源

2016-11-08 19:14:10

大熊貓不能自動照顧這個嗎？

http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append

如果你的指標重疊，不要忘記加上 'ignore_index = TRUE'

來源

2016-11-08 19:21:43

實際上，追加將合併DFs不同（與OP想要的相比）... – MaxU

假設你有以下CSV文件：

test1.csv：

year,month,day,Direct 
1992,1,1,11 
2013,5,30,11 
2004,9,1,11

test2.csv：

year,month,day,Direct,Direct2 
1992,1,1,21,201 
2013,5,30,21,202 
2004,9,1,21,203

test3。CSV：

year,month,day,File3 
1992,1,1,text1 
2013,5,30,text2 
2004,9,1,text3 
2016,1,1,unmatching_date

解決方案：

import glob 
import pandas as pd 

files = glob.glob(r'd:/temp/test*.csv') 

def get_merged(files, **kwargs): 
    df = pd.read_csv(files[0], **kwargs) 
    for f in files[1:]: 
     df = df.merge(pd.read_csv(f, **kwargs), how='outer') 
    return df 

print(get_merged(files))

輸出：

year month day Direct Direct Direct2   File3 
0 1992  1 1  11.0 21.0 201.0   text1 
1 2013  5 30  11.0 21.0 202.0   text2 
2 2004  9 1  11.0 21.0 203.0   text3 
3 2016  1 1  NaN  NaN  NaN unmatching_date

UPDATE：平常慣用pd.concat(list_of_dfs)解決方案將不會在這裏工作，因爲它是由指標加盟：

In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True) 
Out[192]: 
    Direct Direct Direct2   File3 day month year 
0  NaN  11.0  NaN    NaN 1  1 1992 
1  NaN  11.0  NaN    NaN 30  5 2013 
2  NaN  11.0  NaN    NaN 1  9 2004 
3 21.0  NaN 201.0    NaN 1  1 1992 
4 21.0  NaN 202.0    NaN 30  5 2013 
5 21.0  NaN 203.0    NaN 1  9 2004 
6  NaN  NaN  NaN   text1 1  1 1992 
7  NaN  NaN  NaN   text2 30  5 2013 
8  NaN  NaN  NaN   text3 1  9 2004 
9  NaN  NaN  NaN unmatching_date 1  1 2016 

In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True) 
Out[193]: 
     0 1  2  3  4 5  6  7  8  9 10 11    12 
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1   text1 
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30   text2 
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1   text3 
3  NaN NaN NaN NaN  NaN NaN NaN NaN NaN 2016 1 1 unmatching_date

或使用index_col=None明確：

In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True) 
Out[194]: 
    Direct Direct Direct2   File3 day month year 
0  NaN  11.0  NaN    NaN 1  1 1992 
1  NaN  11.0  NaN    NaN 30  5 2013 
2  NaN  11.0  NaN    NaN 1  9 2004 
3 21.0  NaN 201.0    NaN 1  1 1992 
4 21.0  NaN 202.0    NaN 30  5 2013 
5 21.0  NaN 203.0    NaN 1  9 2004 
6  NaN  NaN  NaN   text1 1  1 1992 
7  NaN  NaN  NaN   text2 30  5 2013 
8  NaN  NaN  NaN   text3 1  9 2004 
9  NaN  NaN  NaN unmatching_date 1  1 2016 

In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True) 
Out[195]: 
     0 1  2  3  4 5  6  7  8  9 10 11    12 
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1   text1 
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30   text2 
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1   text3 
3  NaN NaN NaN NaN  NaN NaN NaN NaN NaN 2016 1 1 unmatching_date

以下更地道解決方案的工作，但它改變行和列/數據的原始順序：

In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')] 
    ...: 
    ...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs])) 
    ...: 
    ...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index() 
    ...: 
Out[224]: 
    month day year Direct Direct Direct2   File3 
0  1 1 1992  11.0 21.0 201.0   text1 
1  1 1 2016  NaN  NaN  NaN unmatching_date 
2  5 30 2013  11.0 21.0 202.0   text2 
3  9 1 2004  11.0 21.0 203.0   text3

來源

2016-11-08 20:20:59 MaxU

不使用像這樣的合併在這裏是非慣用和非高性能 - 附加到列表並且concat是patt ern – Jeff

@Jeff，我將如何在使用'concat'的公共列上合併？ – MaxU

試試吧，按照定義它會在非concat軸上結合;加入是一個不同的操作 – Jeff

在Python中，將具有不同頭文件的多個CSV文件讀取到一個數據幀中

回答

相關問題