2015-09-25 74 views
1

我有以下數據框:的Python:從大熊貓提取行數據幀每隔固定時間窗口

df= 
    Record_ID  Time 
     94704 2014-03-10 07:19:19.647342 
     94705 2014-03-10 07:21:44.479363 
     94706 2014-03-10 07:21:45.479581 
     94707 2014-03-10 07:21:54.481588 
     94708 2014-03-10 07:21:55.481804 
     94709 2014-03-10 07:21:56.482029 
     94710 2014-03-10 07:21:57.482254 
     94711 2014-03-10 07:21:58.482473 
     94712 2014-03-10 07:21:59.482706 
     94713 2014-03-10 07:22:00.482917 
     94714 2014-03-10 07:22:01.483279 
     94715 2014-03-10 07:22:02.483545 
     94716 2014-03-10 07:22:03.383563 
     94717 2014-03-10 07:22:04.383786 
     94718 2014-03-10 07:22:09.485624 
     94719 2014-03-10 07:22:10.385118 
     94720 2014-03-10 07:22:11.485454 
     94721 2014-03-10 07:22:12.485592 
     94722 2014-03-10 07:22:15.486335 
     94723 2014-03-10 07:22:16.486475 
     94724 2014-03-10 07:22:17.487023 
     94725 2014-03-10 07:22:18.387020 
     94726 2014-03-10 07:22:19.387120 
     94727 2014-03-10 07:22:20.387379 
     94728 2014-03-10 07:22:22.387786 
     94729 2014-03-10 07:22:23.488032 
     94730 2014-03-10 07:22:24.388232 
     94731 2014-03-10 07:22:30.489594 

我想知道如何創建一個新的數據幀是需要的數據,以每60秒,以減少大小桌子。

+0

這個新的DF實際上會是什麼樣子? –

+0

它看起來像DF,但行數較少。 – emax

+0

您希望以分鐘('T')頻率重新採樣,但您需要指定重採樣的完成方式。 'first','last','mean','sum'... – TomAugspurger

回答

3

您首先需要將索引設置爲您在DataFrame中的Time列。然後,重新取樣如下:

resampled = df.set_index('Time').resample('1min', how='first') 
>>> resampled 
        Record_ID 
Time       
2014-03-10 07:19:00  94704 
2014-03-10 07:20:00  NaN 
2014-03-10 07:21:00  94705 
2014-03-10 07:22:00  94713 

注意,你會得到一個NaN爲07:20,因爲有在此期間沒有記錄。如果需要,您當然可以放棄NaN。

>>> resampled.dropna() 
        Record_ID 
Time       
2014-03-10 07:19:00  94704 
2014-03-10 07:21:00  94705 
2014-03-10 07:22:00  94713 
+0

這很好學 - 謝謝。 – Dickster

+0

謝謝,它可以工作,但最後很多點都不見了。然而,這是一個很好的解決方案 – emax

+0

如果不理解數據,很難提供進一步的建議。也許你可以使用'mean'而不是'first',或者在這段時間內沒有數據可用? – Alexander

0

我拿起一個叫每局這兒功能: How to round the minute of a datetime object python

我把你的樣本數據在一個名爲data.csv

import datetime 


def roundTime(dt=None, roundTo=60): 
    """Round a datetime object to any time laps in seconds 
    dt : datetime.datetime object, default now. 
    roundTo : Closest number of seconds to round to, default 1 minute. 
    Author: Thierry Husson 2012 - Use it as you want but don't blame me. 
    """ 
    if dt == None : dt = datetime.datetime.now() 
    seconds = (dt - dt.min).seconds 
    # // is a floor division, not a comment on following line: 
    rounding = (seconds+roundTo/2) // roundTo * roundTo 
    return dt + datetime.timedelta(0,rounding-seconds,-dt.microsecond) 

df = pd.read_csv('data.csv') 
df['Time'] = pd.to_datetime(df['Time']) 
df['Time'] = df['Time'].map(lambda x : roundTime(x)) 

# now group by Time and select say the first record 
print df.groupby('Time').min() 

或這裏的文件是一個選擇,如果你不想做group by

df['Time'] = pd.to_datetime(df['Time']) 
df['Time'] = df['Time'].map(lambda x : roundTime(x)) 
slice_critera = df['Time'].diff() !=0 
print df[slice_critera]