使用PyParsing解析Snort日誌

使用pyparsing模塊解析Snort日誌時出現問題。使用PyParsing解析Snort日誌

問題在於將Snort日誌（其中有多行條目，以空行分隔）和獲取pyparsing來將每個條目解析爲一個整塊，而不是一行一行地閱讀並期望該語法與每行（顯然，它不）

我已經嘗試將每個塊轉換爲臨時字符串，剝離每個塊內的換行符，但它拒絕正確處理。我可能完全在錯誤的軌道上，但我不這麼認爲（類似的形式完美適用於系統日誌類型的日誌，但這些是單行條目，因此適用於您的基本文件迭代器/行處理）

這裏的日誌和代碼示例我到目前爲止有：

[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**] 
[Classification: Misc activity] [Priority: 3] 
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33 
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88 
Type:3 Code:10 DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED 
** ORIGINAL DATAGRAM DUMP: 
63.44.2.33:41235 -> 172.143.241.86:4949 
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF 
Seq: 0xF74E606 
(32 more bytes of original packet) 
** END OF DUMP 

[**] ...more like this [**]

和更新的代碼：

def snort_parse(logfile): 
    header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]") 
    cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]")) 
    pri = Suppress("[Priority:") + integer + Suppress("]") 
    date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer) 
    src_ip = ip_addr + Suppress("->") 
    dest_ip = ip_addr 
    extra = Regex(".*") 

    bnf = header + cls + pri + date + src_ip + dest_ip + extra 

    def logreader(logfile): 
     chunk = [] 
     with open(logfile) as snort_logfile: 
      for line in snort_logfile: 
       if line !='\n': 
        line = line[:-1] 
        chunk.append(line) 
        continue 
       else: 
        print chunk 
        yield " ".join(chunk) 
        chunk = [] 

    string_to_parse = "".join(logreader(logfile).next()) 
    fields = bnf.parseString(string_to_parse) 
    print fields

任何幫助，指針，RTFMs，你這樣做是非，等。，不勝感激。

來源

2010-08-04 Sam Halicke

import pyparsing as pyp 
import itertools 

integer = pyp.Word(pyp.nums) 
ip_addr = pyp.Combine(integer+'.'+integer+'.'+integer+'.'+integer) 

def snort_parse(logfile): 
    header = (pyp.Suppress("[**] [") 
       + pyp.Combine(integer + ":" + integer + ":" + integer) 
       + pyp.Suppress(pyp.SkipTo("[**]", include = True))) 
    cls = (
     pyp.Suppress(pyp.Optional(pyp.Literal("[Classification:"))) 
     + pyp.Regex("[^]]*") + pyp.Suppress(']')) 

    pri = pyp.Suppress("[Priority:") + integer + pyp.Suppress("]") 
    date = pyp.Combine(
     integer+"/"+integer+'-'+integer+':'+integer+':'+integer+'.'+integer) 
    src_ip = ip_addr + pyp.Suppress("->") 
    dest_ip = ip_addr 

    bnf = header+cls+pri+date+src_ip+dest_ip 

    with open(logfile) as snort_logfile: 
     for has_content, grp in itertools.groupby(
       snort_logfile, key = lambda x: bool(x.strip())): 
      if has_content: 
       tmpStr = ''.join(grp) 
       fields = bnf.searchString(tmpStr) 
       print(fields) 

snort_parse('snort_file')

產生

[['1:486:4', 'Misc activity', '3', '08/03-07:30:02.233350', '172.143.241.86', '63.44.2.33']]

來源

2010-08-04 15:45:56 unutbu

你是上帝。這是超出我的專業知識的解決方案，但只要我讓自己瞭解所有工作部件，即將實施。謝謝！ – 2010-08-04 15:48:57

+1 - 尼斯回答〜unutbu，打我一拳！（你的groupby代碼看起來很瘋狂，當我得到幾分鐘時，我將不得不對其進行排序。） – PaulMcG 2010-08-04 15:55:03

+許多隻是爲了可愛和優雅地使用'groupby'。 – katrielalex 2010-08-04 20:20:03

嗯，我不知道Snort或pyparsing，所以如果我說一些愚蠢的話，提前道歉。我不清楚問題是否pyparsing無法處理條目，或者您無法以正確的格式將它們發送到pyparsing。如果後者，爲什麼不這樣做呢？

def logreader(path_to_file): 
    chunk = [ ] 
    with open(path_to_file) as theFile: 
     for line in theFile: 
      if line: 
       chunk.append(line) 
       continue 
      else: 
       yield "".join(*chunk) 
       chunk = [ ]

當然，如果你需要把它發送到pyparsing之前修改每個數據塊，你可以yield之前這樣做荷蘭國際集團它。

來源

2010-08-04 14:36:13 katrielalex

謝謝，這比原來的要乾淨得多，但是，仍然期待[**]作爲第二行而不是下一個塊的第一行。 – 2010-08-04 15:21:07

我仍然不確定我的理解。你的意思是'pyparsing'不理解塊？我認爲它將新行視爲空白並忽略它們。 – katrielalex 2010-08-04 15:29:25

你有一些正則表達式忘卻的事，但希望這不會是太痛苦了。在你的思維最大的罪魁禍首是使用這種結構的：

some_stuff + Regex(".*") + 
       Suppress(string_representing_where_you_want_the_regex_to_stop)

一個pyparsing分析器中的每個子分析器是非常獨立的，並通過傳入的文本順序工作。所以正則表達式術語沒有辦法看下一個表達式，看看'*'重複應該停止的位置。換句話說，表達式Regex(".*")將直到行結束纔讀取，因爲這是".*"在沒有指定多行的情況下停止的地方。

在pyparsing中，這個概念是使用SkipTo實現的。這是您的標題行是怎麼寫的：「*」：

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
      Suppress("]") + SkipTo("[**]") + Suppress("[**]")

同樣的事情CLS

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
      Suppress("]") + Regex(".*") + Suppress("[**]")

你的問題得到，改成解決。

最後一個錯誤，你的約會的定義是由一個短「：」 +整數

date = integer + "/" + integer + "-" + integer + ":" + integer + "." + 
      Suppress(integer)

應該是：

date = integer + "/" + integer + "-" + integer + ":" + integer + ":" + 
      integer + "." + Suppress(integer)

我認爲這些變化將足以開始分析你的記錄數據。

這裏有一些其他風格的建議：

你有很多重複Suppress("]")表達式。我已經開始定義在這樣的非常緊湊和易於維護的聲明我所有suppressable標點符號：

LBRACK,RBRACK,LBRACE,RBRACE = map(Suppress,"[]{}")

（擴大添加任何你喜歡的其他標點符號）。現在我可以通過它們的符號名稱來使用這些字符，並且我發現生成的代碼更容易閱讀。

從header = Suppress("[**] [") + ...開始標題。我不喜歡以這種方式在文字中嵌入空格，因爲它繞過了一些解析健壯性，pyparsing爲您提供了自動跳過空白的功能。如果由於某種原因，「[**]」和「[」之間的空格被改爲使用2或3個空格或製表符，那麼您的抑制文字將失敗。與之前的建議結合這一點，頭就開始

header = Suppress("[**]") + LBRACK + ...

我知道這是生成的文本，所以在這種格式變化的可能性不大，但它起到更好的pyparsing的優勢。

解析出您的字段後，開始將分析結果名稱分配給解析器中的不同元素。這將使得lot之後更容易獲取數據。例如，改變CLS到：

cls = Optional(Suppress("[Classification:") + 
      SkipTo(RBRACK)("classification") + RBRACK)

將允許您訪問使用fields.classification分類數據。

來源

2010-08-04 15:53:30 PaulMcG

是的。我承認我絕對是在這一塊上找到了正則表達式的錘子（你應該看到它，這實在是太笨拙了） - 但是昨天晚上就開始討論這個問題，而這正是我所想到的。絕對是一些範式轉移，但是隨着數據量的增加以及數據和領域的變化，pyparsing是唯一的選擇。謝謝你的洞察！ – 2010-08-04 15:55:38

並且來自pyparsing的作者nonthless！再次感謝保羅！ – 2010-08-04 16:02:11

用問題評論：是否不再需要使用setResultsName（）方法命名字段？它看起來像上面隱含的快捷方式，但我無法在文檔中找到它。謝謝！ – 2010-08-04 19:27:23

使用PyParsing解析Snort日誌

回答

相關問題