使用熊貓Series.rolling與DateOffset

Python，熊貓，數據分析在這裏。使用熊貓Series.rolling與DateOffset

所以我想要做的是從大量的Apache服務器日誌中確定最繁忙的60分鐘時間間隔。我已經將日誌中的時間戳提取到列表中。

time_recieved是具有這樣的

[ 
1995-07-01T00:01:18-04:00, 
1995-07-01T00:01:19-04:00, 
1995-07-01T00:01:19-04:00, 
1995-07-01T00:01:19-04:00, 
1995-07-01T00:01:19-04:00, 
1995-07-01T00:01:19-04:00, 
1995-07-01T00:01:19-04:00, 
1995-07-01T00:11:45-04:00, 
1995-07-01T00:11:45-04:00, 
1995-07-01T00:11:45-04:00, 
1995-07-01T00:13:43-04:00, 
1995-07-01T00:13:43-04:00, 
1995-07-01T00:13:43-04:00, 
1995-07-01T00:13:43-04:00, 
1995-07-01T00:13:43-04:00, 
1995-07-01T00:13:46-04:00, 
1995-07-01T00:13:47-04:00, 
1995-07-01T00:13:48-04:00, 
1995-07-01T00:13:48-04:00, 
1995-07-01T00:13:48-04:00, 
1995-07-01T00:13:48-04:00, 
1995-07-01T00:13:48-04:00, 
1995-07-01T00:13:48-04:00, 
1995-07-01T00:13:50-04:00, 
1995-07-01T00:13:53-04:00, 
1995-07-01T00:13:53-04:00, 
1995-07-01T00:13:53-04:00, 
1995-07-01T00:13:53-04:00, 
1995-07-01T00:13:53-04:00, 
1995-07-01T00:13:53-04:00, 
1995-07-01T00:14:11-04:00, 
1995-07-01T00:14:17-04:00, 
1995-07-01T00:14:17-04:00, 
1995-07-01T00:14:17-04:00, 
1995-07-01T00:14:17-04:00, 
1995-07-01T00:14:17-04:00, 
1995-07-01T00:14:17-04:00, 
1995-07-01T00:14:18-04:00, 
1995-07-01T00:14:20-04:00, 
1995-07-01T00:14:20-04:00, 
1995-07-01T00:14:20-04:00, 
1995-07-01T00:14:20-04:00, 
1995-07-01T00:14:20-04:00, 
1995-07-01T00:14:20-04:00, 
1995-07-01T00:14:21-04:00, 
1995-07-01T00:14:21-04:00, 
1995-07-01T00:14:21-04:00, 
1995-07-01T00:14:21-04:00, 
1995-07-01T00:14:21-04:00, 
1995-07-01T00:14:21-04:00, 
1995-07-01T00:14:22-04:00, 
1995-07-01T00:14:22-04:00, 
1995-07-01T00:14:23-04:00, 
1995-07-01T00:14:24-04:00, 
1995-07-01T00:14:24-04:00, 
1995-07-01T00:14:24-04:00, 
1995-07-01T00:14:24-04:00, 
1995-07-01T00:14:24-04:00, 
1995-07-01T00:14:26-04:00, 
1995-07-01T00:14:27-04:00, 
1995-07-01T00:14:30-04:00, 
1995-07-01T00:14:30-04:00, 
1995-07-01T00:14:30-04:00, 
1995-07-01T00:14:30-04:00, 
1995-07-01T00:14:30-04:00, 
1995-07-01T00:14:30-04:00, 
1995-07-01T00:14:31-04:00, 
1995-07-01T00:14:32-04:00, 
1995-07-01T00:14:32-04:00, 
1995-07-01T00:14:32-04:00, 
1995-07-01T00:14:32-04:00, 
1995-07-01T00:14:32-04:00, 
1995-07-01T00:14:36-04:00, 
]

我的目標是，沿着時間戳的這個名單，我將能夠獲得60分鐘間隔的那些點中的任意一個開始計值的列表。一旦我得到了滾動窗口，我想我可以處理。

熊貓文檔上

： http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html 我發現有關窗口參數「窗口下面的項：int或偏移移動窗口的大小這是用於計算統計觀測值的數目的每個。窗口的大小是固定的如果它是一個偏移量，那麼這將是每個窗口的時間週期，每個窗口將是一個基於時間段中包含的觀察值的變量，這隻對日期時間類型的索引有效。是0.19.0新增功能「

我正在使用熊貓19.2選項o f根據時間段內的觀察結果，使用可變大小的窗口聽起來就像我想要的那樣。所以，我想實現它：

import pandas as pd 
from pandas.tseries.offsets import DateOffset 
def busiest_timeframe(data,timeframe = 60):  
    time_window = DateOffset(minutes = 60) 
    print (type(time_window)) 
    series = pd.Series(data) 
    series.rolling(time_window).count() 
    return series 

busiest_tf = busiest_timeframe(time_received)

我得到以下錯誤：提高ValueError異常（「窗口必須是整數」）

ValueError: window must be an integer

是存在的，我使用了一些其它的補償對象？這個熊貓功能不起作用嗎？我誤解了文檔嗎？

非常感謝您的幫助和建議！

來源

2017-04-03 Joe Sadaka

'''''''''''''''''''''''''所以，第一個參數必須是一個整數。 – DyZ

您可能正在尋找重採樣器，而不是窗口：'series.resample（'60M'）。count（）'。但是，重採樣器不在滾動，它只是將您的系列分成60分鐘的組。 – DyZ

DYZ熊貓文檔說：「如果它是一個偏移量，那麼這將是每個窗口的時間週期。每個窗口將基於包含在time_period' –

不幸的是我不知道如何使用series.rolling，它好像你沒有將它設置爲索引，這就是爲什麼它沒有工作。但即使如此，我還是有錯誤，所以這裏有一個選擇（也許真的很醜陋），所以如果別人有更好的方法，最好是聽取其他人的意見。

所以是的，它使用布爾索引。如果需要，可以使用代碼（大量的打印語句），也許可以更改> =和< =>和<。

liste=[ 
"1995-07-01T00:01:18-04:00", 
"1995-07-01T00:01:19-04:00", 
"1995-07-01T00:01:19-04:00", 
"1995-07-01T00:01:19-04:00", 
"1995-07-01T00:01:19-04:00", 
"1995-07-01T00:01:19-04:00", 
"1995-07-01T00:01:19-04:00", 
"1995-07-01T00:11:45-04:00", 
"1995-07-01T00:11:45-04:00", 
"1995-07-01T00:11:45-04:00", 
"1995-07-01T00:13:43-04:00", 
"1995-07-01T00:13:43-04:00", 
"1995-07-01T00:13:43-04:00", 
"1995-07-01T00:13:43-04:00", 
"1995-07-01T00:13:43-04:00", 
"1995-07-01T00:13:46-04:00", 
"1995-07-01T00:13:47-04:00", 
"1995-07-01T00:13:48-04:00", 
"1995-07-01T00:13:48-04:00", 
"1995-07-01T00:13:48-04:00", 
"1995-07-01T00:13:48-04:00", 
"1995-07-01T00:13:48-04:00", 
"1995-07-01T00:13:48-04:00", 
"1995-07-01T00:13:50-04:00", 
"1995-07-01T00:13:53-04:00", 
"1995-07-01T00:13:53-04:00", 
"1995-07-01T00:13:53-04:00", 
"1995-07-01T00:13:53-04:00", 
"1995-07-01T00:13:53-04:00", 
"1995-07-01T00:13:53-04:00", 
"1995-07-01T00:14:11-04:00", 
"1995-07-01T00:14:17-04:00", 
"1995-07-01T00:14:17-04:00", 
"1995-07-01T00:14:17-04:00", 
"1995-07-01T00:14:17-04:00", 
"1995-07-01T00:14:17-04:00", 
"1995-07-01T00:14:17-04:00", 
"1995-07-01T00:14:18-04:00", 
"1995-07-01T00:14:20-04:00", 
"1995-07-01T00:14:20-04:00", 
"1995-07-01T00:14:20-04:00", 
"1995-07-01T00:14:20-04:00", 
"1995-07-01T00:14:20-04:00", 
"1995-07-01T00:14:20-04:00", 
"1995-07-01T00:14:21-04:00", 
"1995-07-01T00:14:21-04:00", 
"1995-07-01T00:14:21-04:00", 
"1995-07-01T00:14:21-04:00", 
"1995-07-01T00:14:21-04:00", 
"1995-07-01T00:14:21-04:00", 
"1995-07-01T00:14:22-04:00", 
"1995-07-01T00:14:22-04:00", 
"1995-07-01T00:14:23-04:00", 
"1995-07-01T00:14:24-04:00", 
"1995-07-01T00:14:24-04:00", 
"1995-07-01T00:14:24-04:00", 
"1995-07-01T00:14:24-04:00", 
"1995-07-01T00:14:24-04:00", 
"1995-07-01T00:14:26-04:00", 
"1995-07-01T00:14:27-04:00", 
"1995-07-01T00:14:30-04:00", 
"1995-07-01T00:14:30-04:00", 
"1995-07-01T00:14:30-04:00", 
"1995-07-01T00:14:30-04:00", 
"1995-07-01T00:14:30-04:00", 
"1995-07-01T00:14:30-04:00", 
"1995-07-01T00:14:31-04:00", 
"1995-07-01T00:14:32-04:00", 
"1995-07-01T00:14:32-04:00", 
"1995-07-01T00:14:32-04:00", 
"1995-07-01T00:14:32-04:00", 
"1995-07-01T00:14:32-04:00", 
"1995-07-01T00:14:36-04:00" 
] 
import pandas as pd 

from pandas.tseries.offsets import DateOffset 
def busiest_timeframe(data,timeframe = 1): 

    series = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S') #maybe you dont need the to_datetime here. I did. 
    df=series.to_frame(name="time") 
    df["count"]=[df[(df["time"] >= x) & (df["time"] <= (x+pd.Timedelta(seconds=timeframe)))].size for x in df["time"].values] #change seconds to minutes or whatever you want 
    highest_index=df["count"].idxmax() 
    #print(df.ix[highest_index]["time"]) 
    df2=df[(df["time"] >= df.ix[highest_index]["time"]) & (df["time"] <= (df.ix[highest_index]["time"]+pd.Timedelta(seconds=timeframe)))] #change seconds here to th same as above 
    return df2 
print(busiest_timeframe(liste))

來源

2017-04-04 09:08:29

使用熊貓Series.rolling與DateOffset

回答

相關問題