Python,熊貓,數據分析在這裏。使用熊貓Series.rolling與DateOffset
所以我想要做的是從大量的Apache服務器日誌中確定最繁忙的60分鐘時間間隔。我已經將日誌中的時間戳提取到列表中。
time_recieved是具有這樣的
[
1995-07-01T00:01:18-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:46-04:00,
1995-07-01T00:13:47-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:50-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:14:11-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:18-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:23-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:26-04:00,
1995-07-01T00:14:27-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:31-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:36-04:00,
]
我的目標是,沿着時間戳的這個名單,我將能夠獲得60分鐘間隔的那些點中的任意一個開始計值的列表。一旦我得到了滾動窗口,我想我可以處理。
熊貓文檔上: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html 我發現有關窗口參數 「 窗口下面的項:int或偏移 移動窗口的大小這是用於計算統計觀測值的數目的每個。窗口的大小是固定的 如果它是一個偏移量,那麼這將是每個窗口的時間週期,每個窗口將是一個基於時間段中包含的觀察值的變量,這隻對日期時間類型的索引有效。是0.19.0新增功能 「
我正在使用熊貓19.2選項o f根據時間段內的觀察結果,使用可變大小的窗口聽起來就像我想要的那樣。所以,我想實現它:
import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 60):
time_window = DateOffset(minutes = 60)
print (type(time_window))
series = pd.Series(data)
series.rolling(time_window).count()
return series
busiest_tf = busiest_timeframe(time_received)
我得到以下錯誤: 提高ValueError異常(「窗口必須是整數」)
ValueError: window must be an integer
是存在的,我使用了一些其它的補償對象?這個熊貓功能不起作用嗎?我誤解了文檔嗎?
非常感謝您的幫助和建議!
'''''''''''''''''''''''''所以,第一個參數必須是一個整數。 – DyZ
您可能正在尋找重採樣器,而不是窗口:'series.resample('60M')。count()'。但是,重採樣器不在滾動,它只是將您的系列分成60分鐘的組。 – DyZ
DYZ熊貓文檔說:「如果它是一個偏移量,那麼這將是每個窗口的時間週期。每個窗口將基於包含在time_period' –