解析Python大文件

如何使用正則表達式解析大文件（使用re模塊），而不需要將整個文件加載到字符串（或內存）中？內存映射文件不起作用，因爲它們的內容不能轉換爲某種惰性字符串。 re模塊僅支持字符串作爲內容參數。解析Python大文件

#include <boost/format.hpp> 
#include <boost/iostreams/device/mapped_file.hpp> 
#include <boost/regex.hpp> 
#include <iostream> 

int main(int argc, char* argv[]) 
{ 
    boost::iostreams::mapped_file fl("BigFile.log"); 
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl); 
    boost::regex expr("something usefull"); 
    boost::match_flag_type flags = boost::match_default; 
    boost::iostreams::mapped_file::iterator start, end; 
    start = fl.begin(); 
    end = fl.end(); 
    boost::match_results<boost::iostreams::mapped_file::iterator> what; 
    while(boost::regex_search(start, end, what, expr)) 
    { 
     std::cout<<what[0].str()<<std::endl; 
     start = what[0].second; 
    } 
    return 0; 
}

爲了證明我的要求。我使用C++（和boost）編寫了一個簡短的示例，與我想要的Python相同。

來源

2012-07-26 Alex

除非你需要多行的正則表達式，一行解析文件行。 – Lenna 2012-07-26 17:06:04

或許，如果你改寫了一個問題，你有什麼，以及你想達到什麼，它會給我們一個更好的機會來提出建議 - 除非你堅持一種特定的方法。 – 2012-07-26 17:08:28

這取決於你在做什麼樣的解析。

如果你正在做的解析是面向行，你可以在一個文件中與行迭代：

with open("/some/path") as f: 
    for line in f: 
     parse(line)

否則，你需要在同一時間使用像分塊，通過讀取數據塊並解析它們。顯然，這將涉及更多的小心，以防你試圖匹配與塊邊界重疊。

來源

2012-07-26 17:06:45 Julian

感謝我在流中搜索模式，而不檢查線的邊界 – Alex 2012-07-27 08:52:18

要在朱利安的解決方案闡述，你可以實現分塊（如果你想要做多的正則表達式）的存儲和連接的連續行，像這樣：

list_prev_lines = [] 
for i in range(N): 
    list_prev_lines.append(f.readline()) 
for line in f: 
    list_prev_lines.pop(0) 
    list_prev_lines.append(line) 
    parse(string.join(list_prev_lines))

這將保持之前的N個運行列表行，包括當前行，然後將多行組解析爲單個字符串。

來源

2012-07-26 17:15:48 CosmicComputer

是的，但我不知道需要多少行（一般情況下），實際上這種情況只是將整個文件讀到內存中，而是使用內存映射文件的一般解決方案（因爲易於使用效率好） – Alex 2012-07-27 08:55:11

現在一切正常（Python 3.2.3與Python 2.7在界面上有一些區別）。搜索圖案應與B」只是前綴有（在Python 3.2.3）一個有效的解決方案。

import re 
import mmap 
import pprint 

def ParseFile(fileName): 
    f = open(fileName, "r") 
    print("File opened succesfully") 
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ) 
    print("File mapped succesfully") 
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m) 
    for item in items: 
     pprint.pprint(item.group(0)) 

if __name__ == "__main__": 
    ParseFile("testre")

來源

2012-07-27 16:44:15 Alex

這很簡潔，因爲它允許使用m最後一行正則表達式。 – Rotareti 2017-07-26 11:16:05

解析Python大文件

回答

相關問題