將訪問日誌文件加載到數據框中

我需要處理訪問日誌文件並對其進行處理。是否可以將訪問日誌等日誌文件加載到數據框中並對其進行處理。我有一個時間戳，響應時間和請求url，我想工作。將訪問日誌文件加載到數據框中

例如日誌行：

128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-"

更新：我提取的響應時間，並要求使用普通EXP。所以我想通過添加DF來創建一個數據集。

df2 = pd.DataFrame({ 'time' : pd.Timestamp(timestamp), 
        'reponsetime' : responsetime, 
        'requesturl' : requesturl })

來源

2013-03-18 jantzen05

請爲輸入樣本提供請求的輸出。 – root 2013-03-18 21:25:20

應該有可能。如果您至少想出了一些方法，並嘗試使用自己的方法解析文件並將必需的字段放入數據框中，那麼您很可能會在SO上獲得更好的響應。在這個階段，發佈相關的代碼並描述你面臨的問題（如果你不能得到它的工作）。換句話說，準備回答[「你有什麼嘗試？」]（http://mattgemmell.com/2008/12/08/what-have-you-tried/「沒有嘗試沒有答案」） – crayzeewulf 2013-03-18 21:35:29

我有試圖查看文檔，並沒有找到辦法做到這一點。 – jantzen05 2013-03-18 22:02:26

我推薦使用正則表達式並將數據加載到某種類型的內存結構中（我假定這就是您的意思是數據框）。

我喜歡用科多獸開發正則表達式：http://kodos.sourceforge.net/

對於日誌段，您提供的上方，下面的正則表達式將隔離一些重要的部分：

^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"

科多獸帶來了一些有用的代碼片段太：

rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"""" 
embedded_rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"""" 
matchstr = """128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-"""" 

# method 1: using a compile object 
compile_obj = re.compile(rawstr) 
match_obj = compile_obj.search(matchstr) 

# method 2: using search function (w/ external flags) 
match_obj = re.search(rawstr, matchstr) 

# method 3: using search function (w/ embedded flags) 
match_obj = re.search(embedded_rawstr, matchstr) 

# Retrieve group(s) from match_obj 
all_groups = match_obj.groups() 

# Retrieve group(s) by index 
group_1 = match_obj.group(1) 
group_2 = match_obj.group(2) 
group_3 = match_obj.group(3) 
group_4 = match_obj.group(4) 
group_5 = match_obj.group(5) 
group_6 = match_obj.group(6) 
group_7 = match_obj.group(7) 
group_8 = match_obj.group(8) 
group_9 = match_obj.group(9) 
group_10 = match_obj.group(10) 

# Retrieve group(s) by name 
host = match_obj.group('host') 
day = match_obj.group('day') 
month = match_obj.group('month') 
timestamp = match_obj.group('timestamp')

你可以很容易地構建，以便將日誌加載到內存並開始處理。

來源

2013-03-18 22:01:10

問題標記爲[tag：pandas]。 OP意味着[Pandas DataFrame]（http://pandas.pydata.org/pandas-docs/dev/dsintro.html#dataframe）。 – crayzeewulf 2013-03-18 22:19:13

將訪問日誌文件加載到數據框中

回答

相關問題