2016-04-24 72 views
2

水平合併多個CSV(鍵,值)文件和名稱'在產生的DF value`列我有一個目錄中的16個不同的CSV文件,我試圖將其裝入一個大熊貓數據幀。每個文件有datetimefloat64列。所有的CSV文件都沒有列標題。目錄熊貓:如何使用文件名

location = os.path.join(base_dir, "DirectoryName") 
symbols = os.listdir(location) 
df = pd.DataFrame(index=dates) 
for symbol in symbols: 
    location = os.path.join(base_dir, "DirectoryName", symbol) 
    df_temp = pd.read_csv(location, index_col=0, parse_dates=True, dayfirst=True, na_values=['nan']) 
    df_temp.dropna() 
    df_temp.index = df_temp.index.normalize() 
    df_temp = normalize_data(df_temp) 
    df = df.join(df_temp) 

,我現在的問題是最終的數據框dfdatetime,因爲它的索引,但它的相應的行值列名和第一行充滿了NaN

這裏是快照 notice row values for 2015-04-02

我必須刪除第一行df,但這對於執行其他操作沒有多大幫助,因爲有些數據會丟失。我無法重命名列標題,因爲它對每個文件都不同,我只知道如何靜態更改。

+0

如果你的列'對於每個文件都不同',你將如何將所有的CSV文件合併/加入到單個DF中?你想水平合併它們嗎? – MaxU

+0

如果您將鏈接發送到一個/兩個CSV文件或在這裏發佈一個文本形式的小數據樣本,那麼幫助您會容易得多... – MaxU

+0

所有文件都有第一列的共同點,我用它作爲索引。這裏是文件鏈接@MaxU https://drive.google.com/folderview?id=0B2I8HUL0xRSWVlZNb1hHckRwRVE&usp=sharing – harindersingh

回答

2

我剛剛下載以下文件:

['hash_rate.csv', 
'difficulty.csv', 
'cost_per_tx.csv', 
'block_size.csv', 
'avg_block_size.csv'] 

這就是爲什麼你會看到在結果DF您的數據只是相應的部分。

請在代碼中找到註釋。

代碼:

import os 
import glob 
from collections import defaultdict 
import pandas as pd 

def read_files(filelist): 
    # `dfs` - will contain a list of DFs 
    # that will be concatenated later on 
    dfs = [] 
    for fn in filelist: 
     # parse column name from filename 
     col = os.path.splitext(os.path.split(fn)[-1])[0] 
     # read individual CSV (as data blocks from defaultdict) into temp DF 
     # and add this temporary DF into `dfs` list 
     dfs.append(pd.read_csv(
         fn, 
         parse_dates=[0], 
         header=None, 
         index_col='date', 
         names=['date', col] 
        ) 
     ) 
    # return concatenated horizontally (axis=1) DF 
    return pd.concat(dfs, axis=1) 

def main(): 
    data_files_mask = r'D:\temp\.data\36827502\*.csv' 
    df = read_files(glob.glob(data_files_mask)) 
    print(df) 

if __name__ == '__main__': 
    main() 

輸出:

     block_size  hash_rate avg_block_size cost_per_tx \ 
date 
2015-01-05 18:15:05  34469.0 3.479099e+08  0.375637  8.185000 
2015-01-06 18:15:05  36219.0 3.323940e+08  0.477130  6.598278 
2015-01-07 18:15:05  38212.0 3.560892e+08  0.624724  6.232809 
2015-01-08 18:15:05  40943.0 4.261981e+08  0.754424  7.113695 
2015-01-09 18:15:05  43021.0 4.099610e+08  0.515467  6.199964 
2015-01-10 18:15:05  45487.0 4.655484e+08  0.451940  6.821970 
2015-01-11 18:15:05  47963.0 4.920513e+08  0.535354  7.958116 
2015-01-12 18:15:05  50594.0 6.940933e+08  0.536199  9.415383 
2015-02-04 18:15:05  32832.0 3.413843e+08  0.421406  8.054181 
2015-02-05 18:15:05  34523.0 3.479099e+08  0.373642  8.958115 

         difficulty 
date 
2015-01-05 18:15:05 4.761056e+10 
2015-01-06 18:15:05 4.880749e+10 
2015-01-07 18:15:05 4.940201e+10 
2015-01-08 18:15:05 5.227830e+10 
2015-01-09 18:15:05 5.425663e+10 
2015-01-10 18:15:05 6.081322e+10 
2015-01-11 18:15:05 6.225398e+10 
2015-01-12 18:15:05 7.272278e+10 
2015-02-04 18:15:05 4.671755e+10 
2015-02-05 18:15:05 4.761056e+10 
2

考慮明確定義列與read_csv'snames的說法,在循環使用非常的文件名,symbol(當然代替.csv擴展名):

for symbol in symbols: 
    ... 
    df_temp = pd.read_csv(location, 
          index_col=0, 
          parse_dates=True, 
          dayfirst=True, 
          na_values=['nan'], 
          header=None, 
          names=['date', symbol.replace('.csv', '')])