2014-11-22 130 views
1

我有一個wget日誌文件,並且想要解析該文件,以便我可以爲每個日誌條目提取相關信息。例如IP地址,時間戳,URL等。解析python中的wget日誌文件

下面打印一個示例日誌文件。每條條目的行數和信息細節都不相同。每條線的符號是一致的。

我能提取單個線,但我希望有一個多維數組(或類似):

import re 

f = open('c:/r1/log.txt', 'r').read() 


split_log = re.findall('--[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*', f) 

print split_log 

print len(split_log) 

for element in split_log: 
    print(element) 


####### Start log file example 

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302] 

--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html' 

    0K .......... .......          109K=0.2s 

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429] 

--2014-11-22 10:51:32-- h ttp://www.itb.ie/Vacancies/index.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html' 

    0K .......... .......... ..        118K=0.2s 

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010] 

--2014-11-22 10:51:32-- h ttp://www.itb.ie/Location/howtogetthere.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html' 

    0K .......... .......          111K=0.2s 
+0

您的預期產出是? – 2014-11-22 11:43:03

+0

最終我會將條目寫入數據庫。例如。 IP地址,URL,數據等等。從上面的示例中,我將因此需要諸如日期(1),url(1),http_request(1)用於第一個日誌條目,然後是日期(2),url(2),第二次http_request(2)等。 – Markus 2014-11-22 11:52:06

回答

1

這裏是你如何提取需要的數據並將其存儲在一個元組列表。

我在這裏使用的正則表達式並不完美,但它們與您的示例數據無關。我修改了原始正則表達式,使用更易讀的\d而不是等效的[0-9]。我還使用原始字符串,這通常使得使用正則表達式更容易。

我已經將您的日誌數據嵌入到我的代碼中作爲三引號字符串,所以我不必擔心文件處理。我注意到,有一些在你的日誌文件中的URL的空間,如

h ttp://www.itb.ie/Vacancies/index.html

,但我認爲這些空間是複製粘貼&的假象,他們實際上並不在現實中存在的日誌數據。如果情況並非如此,那麼你的程序將需要做額外的工作來處理這些無關的空間。

我也修改了日誌數據中的IP地址,因此它們並不完全相同,只是爲了確保findall找到的每個IP都與正確的時間戳& URL正確關聯。

#! /usr/bin/env python 

import re 

log_lines = ''' 

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302] 

--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html 
Connecting to www.itb.ie|193.1.36.24|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html' 

    0K .......... .......          109K=0.2s 

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429] 

--2014-11-22 10:51:32-- http://www.itb.ie/Vacancies/index.html 
Connecting to www.itb.ie|193.1.36.25|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html' 

    0K .......... .......... ..        118K=0.2s 

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010] 

--2014-11-22 10:51:32-- http://www.itb.ie/Location/howtogetthere.html 
Connecting to www.itb.ie|193.1.36.26|:80... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: ignored [text/html] 
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html' 

    0K .......... .......          111K=0.2s 
''' 

time_and_url_pat = re.compile(r'--(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})--\s+(.*)') 
ip_pat = re.compile(r'Connecting to.*\|(.*?)\|') 

time_and_url_list = time_and_url_pat.findall(log_lines) 
print '\ntime and url\n', time_and_url_list 

ip_list = ip_pat.findall(log_lines) 
print '\nip\n', ip_list 

all_data = [(t, u, i) for (t, u), i in zip(time_and_url_list, ip_list)] 
print '\nall\n', all_data, '\n' 

for t in all_data: 
    print t 

輸出

time and url 
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html')] 

ip 
['193.1.36.24', '193.1.36.25', '193.1.36.26'] 

all 
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')] 

('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24') 
('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25') 
('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26') 

這段代碼的最後這部分使用列表理解重組的time_and_url_list和ip_list成元組的一個列表中的數據,使用zip內置函數來並行處理這兩個列表。如果這部分有點難以遵循,請讓我知道&我會盡力解釋它。

+0

不錯。感謝那!正是我需要的...... – Markus 2014-11-22 17:09:09