閱讀並複製python中特定的文本塊

我已經看到過幾個類似的問題（複製觸發行或者確定大小的塊），但它們並不完全符合我想要做的。我有一個非常大的文本文件（來自Valgrind的輸出），我只想將其裁減到僅需要的部分。閱讀並複製python中特定的文本塊

該文件的結構如下：它們是以包含字符串'in loss record'的標題行開頭的行塊。我只想觸發那些也包含字符串'definitely lost'的標題行，然後複製下面的所有行，直到到達另一個標題行（此時決策過程重複）。

如何在Python中實現這樣的選擇和複製腳本？

這是我到目前爲止嘗試過的。它可行，但我認爲這不是最有效的（或pythonic）方法，所以我希望看到更快的方法，因爲我正在使用的文件通常很大。（此方法需要1.8秒爲一個290M的文件）

with open("in_file.txt","r") as fin: 
with open("out_file.txt","w") as fout:                                  
    lines = fin.read().split("\n") 
    i=0 
    while i<len(lines): 
     if "blocks are definitely lost in loss record" in lines[i]: 
      fout.write(lines[i].rstrip()+"\n") 
      i+=1 
      while i<len(lines) and "loss record" not in lines[i]: 
       fout.write(lines[i].rstrip()+"\n") 
       i+=1 
     i+=1

來源

2017-04-15 Demosthene

您到目前爲止嘗試過什麼？ –

通過編寫一些代碼。如果您發現了類似的問題，請根據您的具體情況調整答案。 – jonrsharpe

@jonrsharpe我的觀點是，這些類似的問題訴諸於諸如'for line in f'和'if x in f'等等，這在這裏不起作用。 – Demosthene

你可以用正則表達式嘗試使用類似mmap

東西：

import re, mmap 

# create a regex that will define each block of text you want here: 
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M) 
with open(fn, 'r+b') as f: 
    mm = mmap.mmap(f.fileno(), 0) 
    for i, m in enumerate(pat.finditer(mm)): 
     # m is a block that you want. 
     print m.group(1)

由於你沒有輸入例如，正則表達式當然不起作用 - 但你明白了。

隨着mmap整個文件被視爲一個字符串，但不一定全部在內存中，所以可以搜索大文件，並以這種方式選擇它的塊。

如果您的文件在內存中舒適適合，你可以讀取該文件，並直接使用正則表達式（僞的Python）：

with open(fn) as fo: 
    pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M) 
    for i, block in pat.finditer(of.read()): 
     # deal with each block

如果您想通過行非正則表達式的方法行，讀取文件中的行由線（假設它是一個\n分隔文本文件）：

with open(fn) as fo: 
    for line in fo: 
     # deal with each line here 

     # DON'T do something like string=fo.read() and 
     # then iterate over the lines of the string please... 
     # unless you need random access to the lines out of order

來源

2017-04-15 21:18:56 dawg

另一種方式來做到這一點是利用groupby標識標題行，並設置將的寫或忽略以下行的功能。然後，您可以逐行迭代文件並減少內存佔用量。

import itertools 

def megs(val): 
    return val * (2**20) 

def ignorelines(lines): 
    for line in lines: 
     pass 

# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode 
# and a few fewer trips to the disk with larger buffers 
with open('test.log', 'rb', buffering=megs(4)) as infile,\ 
     open('out.log', 'wb', buffering=megs(4)) as outfile: 
    dump_fctn = ignorelines # ignore lines til we see a good header 
    # group by header or contained lines 
    for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x): 
     if is_hdr: 
      for hdr in block: 
       if b'definitely lost' in hdr: 
        outfile.write(hdr) 
        dump_fctn = outfile.writelines 
       else: 
        dump_fctn = ignorelines 
     else: 
      # either writelines or ignorelines, depending on last header seen 
      dump_fctn(block) 

print(open('out.log').read())

來源

2017-04-15 22:19:17 tdelaney

閱讀並複製python中特定的文本塊

回答

相關問題