解析日誌文件以在python中查找相關事件

我有一個日誌文件，我需要解析以查找某個事件是否跟隨其他相關事件。基本上是第一個事件是單獨的還是具有關聯的配對事件。例如，數據的格式如下：解析日誌文件以在python中查找相關事件

Timestamp   Event  Property1  Property2  Property3 
1445210282416  E1    A    1    Type1 * 
1445210282434  F1    D    3    Type10  
1445210282490  E1    C    5    Type2 
1445210282539  E2    A    1    Type1 * 
1445210282943  F1    D    1    Type15 
1445210285452  E2    C    4    Type3

這是一個簡化的示例，但基本上與數據文件相同。我們正在嘗試查找事件E1是否有相應的事件E2，其中Property1,Property2和Property3但與在*顯示的兩個事件中相同。第二個E1事件（第3行）沒有對應的E2事件。我還需要保持對這些事件的計數，並且沒有對應於Property3的對，以作爲以後使用的關鍵字。

這些文件可能相當大（大約1 GB），應避免同時在內存中存儲整個文件。所以，我想我可以使用一個發電機。

從一個最初的嘗試是：

with open(filename, 'rb') as f: 
    finding_pair = 0  # indicator to help determine what to do in a line of the file 
    e1 = {}    # store the E1 row whose pair we want to find 
    without_pair = {}  # store count of E1 events with no pair 

    line = csv.DictReader((line for line in f), delimiter = ' ') 

    for l in line: 
     if l['Event'] = E1 and finding_pair = 0: # find pair for this 
      // Go through file after this line to find E2 event. 
      e1 = l 
      finding_pair = 1 
     elif (l['Event'] = E1 or l['Event'] = F1) and finding_pair = 1: # skip this and keep finding pair 
      continue 
     elif l['Event'] = E2 and finding_pair = 1: # see if this is a pair 
      if l['Property1'] == e1['Property1'] and l['Property2'] == e1['Property2'] and l['Property3'] == e1['Property3']: 
       # pair found 
       finding_pair = 0 
       // Go to next E1 line ?? 
      else: 
       # pair not found 
       without_pair['Property3'] += 1 
       // Go to next E1 line ??

所以，我的問題是：

如何移動迭代器回E1在第3行已經在4排移動到E2後找到我的一對？
E1和E2應該在時間上非常接近（1分鐘內）。我如何避免在1分鐘內限制檢查對。從E1窗口？
有沒有更好的方法來解決這個問題？

來源

2015-10-28 sfactor

爲什麼不直接將每條E1線路添加到without_pair列表中，並且每當您到達E2線路時，都要檢查它是否與任何位於without_pair中的E1線路匹配。如果是，則將其從without_pair中刪除。最後，你只剩下沒有匹配的E1線路。 – Jeremy

在TXR

腳本解決方案：基於複製data到pair.txr和編輯在提取和輸出指令添加。

$ cat pair.txr 
Timestamp   Event  Property1  Property2  Property3 
@ts1 E1 @p1 @p2 @p3 
@(skip) 
@(line ln) 
@ts2 @e2 @p1 @p2 @p3 
@(output) 
Duplicate of E1 found at line @ln: event @e2 timestamp @ts2. 
@(end)

執行命令上的一些不匹配的數據

$ txr pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

執行命令

$ txr pair.txr /etc/motd # failed termination status 
$ echo $? 
1

數據是：

$ cat data 
Timestamp   Event  Property1  Property2  Property3 
1445210282416  E1    A    1    Type1 
1445210282434  F1    D    3    Type10 
1445210282490  E1    C    5    Type2 
1445210282539  E2    A    1    Type1 
1445210282943  F1    D    1    Type15 
1445210285452  E2    C    4    Type3

如果它是一個約束，該第二事件必須專門有名字E2，那麼我們可以簡單地用e2變量替換字面文本E2。

如果您知道重複必須發生在100行內，則可以使用@(skip 100)。這可以避免浪費時間掃描無重複的大文件。當然，100並不一定是恆定的;它可以被計算。如果有多個重複項，@(skip :greedy)將找到最後一個重複項。

請注意，儘管@(line ln)本身就在一條線上，但它具有不消耗線條的語義。它將ln變量綁定到輸入中的當前行號，但不會前進到下一行，以使模式語言的後續行應用於同一行。因此ln表示該模式匹配的行。

現在，我們來做一些有趣的事情：讓我們使用E1變量和第二個事件。此外，我們不要假設要匹配的事件是第一個：

Timestamp   Event  Property1  Property2  Property3 
@(skip) 
@ts1 @e1 @p1 @p2 @p3 
@(skip) 
@(line ln) 
@ts2 @e2 @p1 @p2 @p3 
@(output) 
Duplicate of @e1 found at line @ln: event @e2 timestamp @ts2. 
@(end)

既然這樣，這個代碼將現在只是發現在數據中的第一一對：

$ txr pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

我們現在能做的就是從這樣的命令行約束變量：

# Is there an E1 followed by a duplicate? 
$ txr -De1=E1 pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539. 

# Is there an E2 followed by a duplicate? 
$ txr -De1=E2 pair.txr data 
$ echo $? 
1 

# Is there some event which is followed by a dupe called E2? 
$ txr -De2=E2 pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539. 

# Is there a pair of duplicates whose Property3 is Type1? 
$ txr -Dp3=Type1 pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

遊戈照片。

來源

2015-12-14 22:08:12 Kaz

解析日誌文件以在python中查找相關事件

回答

相關問題