2016-03-01 54 views
2

我試圖讀取許多文件。每個文件是每10分鐘一次的數據文件。每個文件中的數據是怎麼樣的「分塊了」這樣的:從具有熊貓的文件中讀取特定的日期行python

2015-11-08 00:10:00 00:10:00 
# z speed dir  W sigW  bck error 
30 3.32 111.9 0.15 0.12 1.50E+05  0 
40 3.85 108.2 0.07 0.14 7.75E+04  0 
50 4.20 107.9 0.06 0.15 4.73E+04  0 
60 4.16 108.5 0.03 0.19 2.73E+04  0 
70 4.06 93.6 0.03 0.23 9.07E+04  0 
80 4.06 93.8 0.07 0.28 1.36E+05  0 

2015-11-08 00:20:00 00:10:00 
# z speed dir  W sigW  bck error 
30 3.79 120.9 0.15 0.11 7.79E+05  0 
40 4.36 115.6 0.04 0.13 2.42E+05  0 
50 4.71 113.6 0.07 0.14 6.84E+04  0 
60 5.00 113.3 0.13 0.17 1.16E+04  0 
70 4.29 94.2 0.22 0.20 1.38E+05  0 
80 4.54 94.1 0.11 0.25 1.76E+05  0 

2015-11-08 00:30:00 00:10:00 
# z speed dir  W sigW  bck error 
30 3.86 113.6 0.13 0.10 2.68E+05  0 
40 4.34 116.1 0.09 0.11 1.41E+05  0 
50 5.02 112.8 0.04 0.12 7.28E+04  0 
60 5.36 110.5 0.01 0.14 5.81E+04  0 
70 4.67 95.4 0.14 0.16 7.69E+04  0 
80 4.56 95.0 0.15 0.21 9.84E+04  0 

... 

的文件繼續這樣下去,每10分鐘一整天。該文件的文件名是151108.mnd。我希望我的代碼能夠讀取所有11月份的文件,因此1511 ??。mnd和我希望我的代碼在整個月的每一天文件中讀取所有的日期時間行,所以對於我剛剛展示的部分數據文件示例我想要我的代碼抓取2015-11-08 00:10:00,2015-11-08 00:20:00,2015-11-08 00:30:00等存儲爲變量,然後轉到第二天的文件(151109.mnd),並抓住所有的日期時間行和存儲爲日期變量,並追加到以前存儲的日期。等等等整個月。這裏是我的代碼至今:

import pandas as pd 
import glob 
import datetime 

filename = glob.glob('1511??.mnd') 
data_nov15_hereford = pd.DataFrame() 
frames = [] 
dates = [] 
counter = 1 
for i in filename: 
    f_nov15_hereford = pd.read_csv(i, skiprows = 32) 
    for line in f_nov15_hereford: 
     if line.startswith("20"): 
      print line 
      date_object = datetime.datetime.strptime(line[:-6], '%Y-%m-%d %H:%M:%S %f') 
      dates.append(date_object) 
      counter = 0 
     else: 
      counter += 1 
    frames.append(f_nov15_hereford) 
data_nov15_hereford = pd.concat(frames,ignore_index=True) 
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True) 


print dates 

此代碼有一些問題,因爲當我打印日期,它打印出每次約會的兩個副本,它也只能打印出每一個文件,以便2015-11的第一次約會-08 00:10:00,2015-11-09 00:10:00等等。它不會在每個文件中一行一行,然後一旦該文件中的所有日期都存儲到下一個文件我想要。相反,它只是抓住每個文件中的第一個日期。有關此代碼的任何幫助?有沒有更簡單的方法去做我想要的?謝謝!

回答

1

幾個意見:

第一:爲什麼你只得到一個文件中的第一次約會:

f_nov15_hereford = pd.read_csv(i, skiprows = 32) 
for line in f_nov15_hereford: 
    if line.startswith("20"): 

第一行讀取該文件,進入大熊貓數據幀。第二行遍歷數據框的列,而不是行。因此,最後一行檢查列是否以「20」開頭。這隻會發生一次每個文件。

第二:counter被初始化,它的值被改變,但它從來沒有被使用過。我認爲它是用來跳過文件中的行。

第三:將所有日期收集到Python列表中,然後在需要時將其轉換爲熊貓數據框可能會更簡單。

import pandas as pd 
import glob 
import datetime as dt 

# number of lines to skip before the first date 
offset = 32 

# number of lines from one date to the next 
recordlength = 9 

pattern = '1511??.mnd' 

dates = [] 

for filename in glob.iglob(pattern): 

    with open(filename) as datafile: 

     count = -offset 
     for line in datafile: 
      if count == 0: 
       fmt = '%Y-%m-%d %H:%M:%S %f' 
       date_object = dt.datetime.strptime(line[:-6], fmt) 
       dates.append(date_object) 

      count += 1 

      if count == recordlength: 
       count = 0 

data_nov15_hereford = pd.DataFrame(dates, columns=['Dates']) 

print dates 
+0

這似乎很好!我唯一的抱怨是,當我打印日期它仍然給我2套。或者如果我打印np。形狀(日期)我得到兩個形狀(2046L,) (2046L,) – HM14

+0

沒關係,我認爲這是我的筆記本問題,而不是代碼!非常感謝! – HM14

1

考慮在讀入數據框之前逐行修改csv數據。下面打開glob列表中的原始文件,並寫入移到日期到最後一列的臨時文件,刪除多個標題和空行。

CSV數據(假設csv文件的文本視圖看起來像以下;​​如果不是實際不同,調整PY代碼)

2015-11-0800:10:0000:10:00,,,,,, 
z,speed,dir,W,sigW,bck,error 
30,3.32,111.9,0.15,0.12,1.50E+05,0 
40,3.85,108.2,0.07,0.14,7.75E+04,0 
50,4.2,107.9,0.06,0.15,4.73E+04,0 
60,4.16,108.5,0.03,0.19,2.73E+04,0 
70,4.06,93.6,0.03,0.23,9.07E+04,0 
80,4.06,93.8,0.07,0.28,1.36E+05,0 
,,,,,, 
2015-11-0800:10:0000:20:00,,,,,, 
z,speed,dir,W,sigW,bck,error 
30,3.79,120.9,0.15,0.11,7.79E+05,0 
40,4.36,115.6,0.04,0.13,2.42E+05,0 
50,4.71,113.6,0.07,0.14,6.84E+04,0 
60,5,113.3,0.13,0.17,1.16E+04,0 
70,4.29,94.2,0.22,0.2,1.38E+05,0 
80,4.54,94.1,0.11,0.25,1.76E+05,0 
,,,,,, 
2015-11-0800:10:0000:30:00,,,,,, 
z,speed,dir,W,sigW,bck,error 
30,3.86,113.6,0.13,0.1,2.68E+05,0 
40,4.34,116.1,0.09,0.11,1.41E+05,0 
50,5.02,112.8,0.04,0.12,7.28E+04,0 
60,5.36,110.5,0.01,0.14,5.81E+04,0 
70,4.67,95.4,0.14,0.16,7.69E+04,0 
80,4.56,95,0.15,0.21,9.84E+04,0 

的Python腳本

import glob, os 
import pandas as pd 

filenames = glob.glob('1511??.mnd') 
temp = 'temp.csv' 

# INITIATE EMPTY DATAFRAME 
data_nov15_hereford = pd.DataFrame(columns=['z', 'speed', 'dir', 'W', 
              'sigW', 'bck', 'error', 'date']) 

# ITERATE THROUGH EACH FILE IN GLOB LIST 
for file in filenames:   
    # DELETE PRIOR TEMP VERSION      
    if os.path.exists(temp): os.remove(temp) 

    header = 0 
    # READ IN ORIGINAL CSV 
    with open(file, 'r') as txt1: 
     for rline in txt1: 
      # SAVE DATE VALUE THEN SKIP ROW 
      if "2015-11" in rline: date = rline.replace(',',''); continue 

      # SKIP BLANK LINES (CHANGE IF NO COMMAS)    
      if rline == ',,,,,,\n': continue 

      # ADD NEW 'DATE' COLUMN AND SKIP OTHER HEADER LINES 
      if 'z,speed,dir,W,sigW,bck,error' in rline: 
       if header == 1: continue 
       rline = rline.replace('\n', ',date\n') 
       with open(temp, 'a') as txt2: 
        txt2.write(rline) 
       continue 
      header = 1 

      # APPEND LINE TO TEMP CSV WITH DATE VALUE 
      with open(temp, 'a') as txt2: 
       txt2.write(rline.replace('\n', ','+date)) 

    # APPEND TEMP FILE TO DATA FRAME 
    data_nov15_hereford = data_nov15_hereford.append(pd.read_csv(temp)) 

輸出

 z speed dir  W sigW  bck error      date 
0 30 3.32 111.9 0.15 0.12 150000  0 2015-11-0800:10:0000:10:00 
1 40 3.85 108.2 0.07 0.14 77500  0 2015-11-0800:10:0000:10:00 
2 50 4.20 107.9 0.06 0.15 47300  0 2015-11-0800:10:0000:10:00 
3 60 4.16 108.5 0.03 0.19 27300  0 2015-11-0800:10:0000:10:00 
4 70 4.06 93.6 0.03 0.23 90700  0 2015-11-0800:10:0000:10:00 
5 80 4.06 93.8 0.07 0.28 136000  0 2015-11-0800:10:0000:10:00 
6 30 3.79 120.9 0.15 0.11 779000  0 2015-11-0800:10:0000:20:00 
7 40 4.36 115.6 0.04 0.13 242000  0 2015-11-0800:10:0000:20:00 
8 50 4.71 113.6 0.07 0.14 68400  0 2015-11-0800:10:0000:20:00 
9 60 5.00 113.3 0.13 0.17 11600  0 2015-11-0800:10:0000:20:00 
10 70 4.29 94.2 0.22 0.20 138000  0 2015-11-0800:10:0000:20:00 
11 80 4.54 94.1 0.11 0.25 176000  0 2015-11-0800:10:0000:20:00 
12 30 3.86 113.6 0.13 0.10 268000  0 2015-11-0800:10:0000:30:00 
13 40 4.34 116.1 0.09 0.11 141000  0 2015-11-0800:10:0000:30:00 
14 50 5.02 112.8 0.04 0.12 72800  0 2015-11-0800:10:0000:30:00 
15 60 5.36 110.5 0.01 0.14 58100  0 2015-11-0800:10:0000:30:00 
16 70 4.67 95.4 0.14 0.16 76900  0 2015-11-0800:10:0000:30:00 
17 80 4.56 95.0 0.15 0.21 98400  0 2015-11-0800:10:0000:30:00 
+0

這非常有用!謝謝! – HM14

相關問題