2016-03-02 187 views
1

我想用熊貓填補ascii文件中時間序列數據中的缺失點。我認爲其他的東西都沒問題,但是第一行是儘管有原始數據,但仍然充滿了楠。 我的數據樣本是:用pandas.date_range和pandas.reindex填充時間序列數據中的缺失點python

"2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7 
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4 
            . 
            . 


"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1 

我使用下面的代碼:

t1 = np.genfromtxt(INPUT,dtype=None,delimiter=',',usecols=[0]) 
start = t1[0].strip('\'"') 
end = t1[-1].strip('\'"') 
data=pd.read_csv(INPUT,sep=',',index_col=[0],parse_dates=[0]) 
index = pd.date_range(start,end,freq="30S") 
df = data 
sk_f = df.reindex(index) 
與此代碼

所以,我想讀的第一和第一列的最後一個字符串,使他們以填充指示爲nan的可能缺失點。然而,問題是,第一欄也填寫結果如下:

2011-08-26 00:00:00,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan 

2011-08-26 00:00:30,1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4 
            . 
            . 


2011-08-26 23:59:30,1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1 

這意味着,第一行是充滿意外有eventhough在原來的文件數據。從第二行開始,每件事情都可以,填寫缺失的數據也似乎沒問題。我試圖找出爲什麼會發生。說實話,我還找不到原因。 任何想法或幫助將非常感激。 謝謝 艾薩克

回答

2

我想你可以通過genfromtxt省略讀取的文件,並嘗試只read_csv,然後發現minmax日期reindex方法。

或者使用resample

import pandas as pd 
import numpy as np 
import io 

temp=u""""2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7 
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4 
"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1""" 

#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0], parse_dates=[0], header=None) 
print df 
          1  2  3  4  5  6   7 \ 
0                    
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665 
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735 
2011-08-26 23:59:30 1155297 12.620 28.06 3.162 1.356 24.30 111.4614 

         8  9  10 11  12  13  14  15 
0                   
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7 
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4 
2011-08-26 23:59:30 28.65 29.84 19.53 0 111.4 13.06 29.50 350.1 
start = df.index.min() 
end = df.index.max() 
print start 
2011-08-26 00:00:00 
print end 
2011-08-26 23:59:30 

index = pd.date_range(start,end,freq="30S") 
sk_f = df.reindex(index) 
print sk_f.head() 
          1  2  3  4  5  6   7 \ 
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665 
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735 
2011-08-26 00:01:00  NaN NaN NaN NaN NaN NaN  NaN 
2011-08-26 00:01:30  NaN NaN NaN NaN NaN NaN  NaN 
2011-08-26 00:02:00  NaN NaN NaN NaN NaN NaN  NaN 

         8  9  10 11  12  13  14  15 
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7 
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4 
2011-08-26 00:01:00 NaN NaN NaN NaN NaN NaN NaN NaN 
2011-08-26 00:01:30 NaN NaN NaN NaN NaN NaN NaN NaN 
2011-08-26 00:02:00 NaN NaN NaN NaN NaN NaN NaN NaN 
print df.resample('30S', fill_method='ffill').head() 
          1  2  3  4  5  6   7 \ 
0                   
2011-08-26 00:00:00 1155179 3.232 23.70 3.281 0.386 25.27 111.5665 
2011-08-26 00:00:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735 
2011-08-26 00:01:00 1155180 3.289 20.44 2.153 0.222 25.25 111.5735 
2011-08-26 00:01:30 1155180 3.289 20.44 2.153 0.222 25.25 111.5735 
2011-08-26 00:02:00 1155180 3.289 20.44 2.153 0.222 25.25 111.5735 

         8  9  10 11  12  13  14  15 
0                   
2011-08-26 00:00:00 28.92 29.83 19.13 0 111.5 13.02 29.77 345.7 
2011-08-26 00:00:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4 
2011-08-26 00:01:00 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4 
2011-08-26 00:01:30 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4 
2011-08-26 00:02:00 28.94 29.82 19.53 0 111.5 13.02 29.79 342.4 
+1

我認爲問題是不明確的,你在哪裏得到的第一線丟失的數據。在'read_csv'中,在'reindex'中?請檢查我的解決方案,如果還有工作,我嘗試找出原因。謝謝。 – jezrael

+1

非常感謝。它工作得很好。艾薩克 – Isaac

相關問題