解析python中的wget日誌文件

我有一個wget日誌文件，並且想要解析該文件，以便我可以爲每個日誌條目提取相關信息。例如IP地址，時間戳，URL等。解析python中的wget日誌文件

下面打印一個示例日誌文件。每條條目的行數和信息細節都不相同。每條線的符號是一致的。

我能提取單個線，但我希望有一個多維數組（或類似）：

import re 

f = open('c:/r1/log.txt', 'r').read() 


split_log = re.findall('--[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*', f) 

print split_log 

print len(split_log) 

for element in split_log: 
    print(element) 


####### Start log file example 

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302] 

--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html' 

    0K .......... .......          109K=0.2s 

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429] 

--2014-11-22 10:51:32-- h ttp://www.itb.ie/Vacancies/index.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html' 

    0K .......... .......... ..        118K=0.2s 

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010] 

--2014-11-22 10:51:32-- h ttp://www.itb.ie/Location/howtogetthere.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html' 

    0K .......... .......          111K=0.2s

來源

2014-11-22 Markus

您的預期產出是？ – 2014-11-22 11:43:03

最終我會將條目寫入數據庫。例如。 IP地址，URL，數據等等。從上面的示例中，我將因此需要諸如日期（1），url（1），http_request（1）用於第一個日誌條目，然後是日期（2），url（2），第二次http_request（2）等。 – Markus 2014-11-22 11:52:06

這裏是你如何提取需要的數據並將其存儲在一個元組列表。

我在這裏使用的正則表達式並不完美，但它們與您的示例數據無關。我修改了原始正則表達式，使用更易讀的\d而不是等效的[0-9]。我還使用原始字符串，這通常使得使用正則表達式更容易。

我已經將您的日誌數據嵌入到我的代碼中作爲三引號字符串，所以我不必擔心文件處理。我注意到，有一些在你的日誌文件中的URL的空間，如

h ttp://www.itb.ie/Vacancies/index.html

，但我認爲這些空間是複製粘貼&的假象，他們實際上並不在現實中存在的日誌數據。如果情況並非如此，那麼你的程序將需要做額外的工作來處理這些無關的空間。

我也修改了日誌數據中的IP地址，因此它們並不完全相同，只是爲了確保findall找到的每個IP都與正確的時間戳& URL正確關聯。

#! /usr/bin/env python 

import re 

log_lines = ''' 

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302] 

--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html' 

    0K .......... .......          109K=0.2s 

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429] 

--2014-11-22 10:51:32-- http://www.itb.ie/Vacancies/index.html 
Connecting to www.itb.ie|193.1.36.25|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html' 

    0K .......... .......... ..        118K=0.2s 

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010] 

--2014-11-22 10:51:32-- http://www.itb.ie/Location/howtogetthere.html 
Connecting to www.itb.ie|193.1.36.26|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html' 

    0K .......... .......          111K=0.2s 
''' 

time_and_url_pat = re.compile(r'--(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})--\s+(.*)') 
ip_pat = re.compile(r'Connecting to.*\|(.*?)\|') 

time_and_url_list = time_and_url_pat.findall(log_lines) 
print '\ntime and url\n', time_and_url_list 

ip_list = ip_pat.findall(log_lines) 
print '\nip\n', ip_list 

all_data = [(t, u, i) for (t, u), i in zip(time_and_url_list, ip_list)] 
print '\nall\n', all_data, '\n' 

for t in all_data: 
    print t

輸出

time and url 
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html')] 

ip 
['193.1.36.24', '193.1.36.25', '193.1.36.26'] 

all 
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')] 

('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24') 
('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25') 
('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')

這段代碼的最後這部分使用列表理解重組的time_and_url_list和ip_list成元組的一個列表中的數據，使用zip內置函數來並行處理這兩個列表。如果這部分有點難以遵循，請讓我知道&我會盡力解釋它。

來源

2014-11-22 13:02:59

不錯。感謝那！正是我需要的...... – Markus 2014-11-22 17:09:09

解析python中的wget日誌文件

回答

相關問題