2017-01-03 68 views
1

我有一個文件夾trip_data包含日期諸多csv文件,它看起來像這樣:大熊貓讀取CSV用正則表達式

trip_data/ 
├── df_trip_20140803_1.csv 
├── df_trip_20140803_2.csv 
├── df_trip_20140803_3.csv 
├── df_trip_20140803_4.csv 
├── df_trip_20140803_5.csv 
├── df_trip_20140803_6.csv 
├── df_trip_20140804_1.csv 
├── df_trip_20140804_2.csv 
├── df_trip_20140804_3.csv 
├── df_trip_20140804_4.csv 
├── df_trip_20140804_5.csv 
├── df_trip_20140804_6.csv 
├── df_trip_20140805_1.csv 
├── df_trip_20140805_2.csv 
├── df_trip_20140805_3.csv 
├── df_trip_20140805_4.csv 
├── df_trip_20140805_5.csv 
├── df_trip_20140805_6.csv 
├── df_trip_20140806_1.csv 
├── df_trip_20140806_2.csv 
├── df_trip_20140806_3.csv 
├── df_trip_20140806_4.csv 

現在我想按日期與蟒蛇大熊貓分別加載所有這些文件,意味着4數據幀df_traip_20140803, df_traip_20140804, df_traip_20140805, df_traip_20140806

我的代碼如下所示:

days = [20140803,20140804,20140805,20140806] 

for day in days: 
    ## Locate to the path 
    path ='./trip_data/df_trip_%d*.csv' % day 
    df = pd.read_csv(path, header=None, nrows=10, 
         names=['ID','lat','lon','status','timestamp']) 

哪個不能得到正確的結果。我怎樣才能做到這一點?

回答

2

我會收集所有這些CSV與以下結構DataFrames的詞典:

df['20140803'] - 包含屬於所有df_trip_20140803_*.csv CSV文件連接數據DF。

解決方案:

import os 
import re 
import glob 
import pandas as pd 

fpattern = r'D:\temp\.data\41444939\df_trip_{}_{}.csv' 
files = glob.glob(fpattern.format('*','*')) 

dates = sorted(set([re.split(r'_(\d{8})_(\d+)\.(\w+)', f)[1] for f in files])) 

dfs = {} 
for d in dates: 
    dfs[d] = pd.concat((pd.read_csv(f) for f in glob.glob(fpattern.format(d, '*'))), ignore_index=True) 

測試:

In [95]: dfs.keys() 
Out[95]: dict_keys(['20140804', '20140805', '20140803', '20140806']) 

In [96]: dfs['20140803'] 
Out[96]: 
    a b c 
0 0 0 7 
1 3 7 1 
2 9 7 3 
3 7 4 7 
4 5 2 4 
5 0 0 4 
6 7 2 2 
7 8 4 1 
8 0 8 3 
9 3 9 0 
10 7 3 9 
11 1 9 8 
12 6 7 2 
13 3 8 1 
14 3 4 5 
15 0 9 2 
16 5 8 7 
17 8 5 4 
18 2 0 2 
19 9 6 6 
20 6 6 6 
21 2 6 9 
22 1 0 8 
23 3 1 1 
24 7 4 2 
25 7 4 2 
26 8 3 7 
27 7 3 2 
28 1 7 7 
29 3 6 5 

設置:

fn = r'D:\temp\.data\41444939\a.txt' 
base_dir = r'D:\temp\.data\41444939' 
files = open(fn).read().splitlines() 
for f in files: 
    pd.DataFrame(np.random.randint(0, 10, (5, 3)), columns=list('abc')) \ 
     .to_csv(os.path.join(base_dir, f), index=False) 
+0

感謝MaxU,你的解決方案幫助了我很多! – jjdblast

+0

@jjdblast,很高興我能幫忙:-) – MaxU