2014-10-08 19 views
2

看來,對於1Min的條數據,採樣頻率爲8的任意倍數的resample()有一個bug。下面的代碼說明了在[3,5,6,8,16] Min進行重採樣時的錯誤。對於3和5頻率,重新採樣數據幀索引的第一個條目從基本時間戳開始(本例中爲9:30),而對於頻率8和16,重新採樣索引分別在9時26分和9時18分開始。熊貓盤中8分鐘取樣錯誤?

import pandas as pd 
import datetime as dt 
import numpy as np 

datetime_start = dt.datetime(2014, 9, 1, 9, 30) 
datetime_end = dt.datetime(2014, 9, 1, 16, 0) 

tt = pd.date_range(datetime_start, datetime_end, freq='1Min') 
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A']) 

for freq in [3, 5, 6, 8, 16]: 
    print freq 
    print df.resample(str(freq) + 'Min', how='first', base=30).head(2) 

產生以下輸出:

3 
        A 
2014-09-01 09:30:00 0 
2014-09-01 09:33:00 3 
5 
        A 
2014-09-01 09:30:00 0 
2014-09-01 09:35:00 5 
6 
        A 
2014-09-01 09:30:00 0 
2014-09-01 09:36:00 6 
8 
        A 
2014-09-01 09:26:00 0 
2014-09-01 09:34:00 4 
16 
        A 
2014-09-01 09:18:00 0 
2014-09-01 09:34:00 4 
+1

這是一個小漏洞,固定在0.15.0:https://github.com/pydata/pandas/issues/8371,0.15.0釋放候選可以是看到這裏:http://pandas.pydata.org/ – Jeff 2014-10-08 20:21:20

+0

我不認爲這是相同的錯誤。對於非主頻率(6Min以上),代碼工作正常。對於主頻3和5,它也能正常工作。對於8和16,它不能正常工作。 – 2014-10-08 20:29:31

+0

以供參考:https://github.com/pydata/pandas/issues/8521 – 2014-10-09 16:59:35

回答

0

我認爲重採樣是在00:00:00鹼所以使用偏移索引至00:00然後重新取樣。

方法1

import pandas as pd 
import datetime as dt 
import numpy as np 

datetime_start = dt.datetime(2014, 9, 1, 9, 30) 
datetime_end = dt.datetime(2014, 9, 1, 16, 30) 

tt = pd.date_range(datetime_start, datetime_end, freq='1Min') 
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A']) 

offsets = pd.offsets.Hour(9) + pd.offsets.Minute(30) 
for freq in [1,3,5,6,8, 16]: 
    print(freq) 
    df.index = df.index - offsets 
    df = df.resample(str(freq) + 'T').agg({'A':'first'}) 
    df.index = df.index + offsets 
    print(df.head(2)) 

方法2:使用鹼等索引偏移量。

import pandas as pd 
import datetime as dt 
import numpy as np 

datetime_start = dt.datetime(2014, 9, 1, 9, 30) 
datetime_end = dt.datetime(2014, 9, 1, 16, 30) 

tt = pd.date_range(datetime_start, datetime_end, freq='1Min') 
df = pd.DataFrame(np.arange(len(tt)), index=tt, columns=['A']) 

for freq in [1,3,5,6,8, 16]: 
    print(freq) 
    df = df.resample(str(freq) + 'T',base=9*60+30).agg({'A':'first'}) 
    print(df.head(2)) 

然後輸出

1 
        A 
2014-09-01 09:30:00 0 
2014-09-01 09:31:00 1 
3 
        A 
2014-09-01 09:30:00 0 
2014-09-01 09:33:00 3 
5 
        A 
2014-09-01 09:30:00 0 
2014-09-01 09:35:00 6 
6 
         A 
2014-09-01 09:30:00 0 
2014-09-01 09:36:00 12 
8 
         A 
2014-09-01 09:30:00 0 
2014-09-01 09:38:00 15 
16 
         A 
2014-09-01 09:30:00 0 
2014-09-01 09:46:00 21