2017-04-15 53 views
0

我已經看到過幾個類似的問題(複製觸發行或者確定大小的塊),但它們並不完全符合我想要做的。我有一個非常大的文本文件(來自Valgrind的輸出),我只想將其裁減到僅需要的部分。閱讀並複製python中特定的文本塊

該文件的結構如下:它們是以包含字符串'in loss record'的標題行開頭的行塊。我只想觸發那些也包含字符串'definitely lost'的標題行,然後複製下面的所有行,直到到達另一個標題行(此時決策過程重複)。

如何在Python中實現這樣的選擇和複製腳本?

這是我到目前爲止嘗試過的。它可行,但我認爲這不是最有效的(或pythonic)方法,所以我希望看到更快的方法,因爲我正在使用的文件通常很大。 (此方法需要1.8秒爲一個290M的文件)

with open("in_file.txt","r") as fin: 
with open("out_file.txt","w") as fout:                                  
    lines = fin.read().split("\n") 
    i=0 
    while i<len(lines): 
     if "blocks are definitely lost in loss record" in lines[i]: 
      fout.write(lines[i].rstrip()+"\n") 
      i+=1 
      while i<len(lines) and "loss record" not in lines[i]: 
       fout.write(lines[i].rstrip()+"\n") 
       i+=1 
     i+=1 
+0

您到目前爲止嘗試過什麼? –

+0

通過編寫一些代碼。如果您發現了類似的問題,請根據您的具體情況調整答案。 – jonrsharpe

+0

@jonrsharpe我的觀點是,這些類似的問題訴諸於諸如'for line in f'和'if x in f'等等,這在這裏不起作用。 – Demosthene

回答

2

你可以用正則表達式嘗試使用類似mmap

東西:

import re, mmap 

# create a regex that will define each block of text you want here: 
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M) 
with open(fn, 'r+b') as f: 
    mm = mmap.mmap(f.fileno(), 0) 
    for i, m in enumerate(pat.finditer(mm)): 
     # m is a block that you want. 
     print m.group(1) 

由於你沒有輸入例如,正則表達式當然不起作用 - 但你明白了。

隨着mmap整個文件被視爲一個字符串,但不一定全部在內存中,所以可以搜索大文件,並以這種方式選擇它的塊。

如果您的文件在內存中舒適適合,你可以讀取該文件,並直接使用正則表達式(僞的Python):

with open(fn) as fo: 
    pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M) 
    for i, block in pat.finditer(of.read()): 
     # deal with each block 

如果您想通過行非正則表達式的方法行,讀取文件中的行由線(假設它是一個\n分隔文本文件):

with open(fn) as fo: 
    for line in fo: 
     # deal with each line here 

     # DON'T do something like string=fo.read() and 
     # then iterate over the lines of the string please... 
     # unless you need random access to the lines out of order 
0

另一種方式來做到這一點是利用groupby標識標題行,並設置將的寫或忽略以下行的功能。然後,您可以逐行迭代文件並減少內存佔用量。

import itertools 

def megs(val): 
    return val * (2**20) 

def ignorelines(lines): 
    for line in lines: 
     pass 

# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode 
# and a few fewer trips to the disk with larger buffers 
with open('test.log', 'rb', buffering=megs(4)) as infile,\ 
     open('out.log', 'wb', buffering=megs(4)) as outfile: 
    dump_fctn = ignorelines # ignore lines til we see a good header 
    # group by header or contained lines 
    for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x): 
     if is_hdr: 
      for hdr in block: 
       if b'definitely lost' in hdr: 
        outfile.write(hdr) 
        dump_fctn = outfile.writelines 
       else: 
        dump_fctn = ignorelines 
     else: 
      # either writelines or ignorelines, depending on last header seen 
      dump_fctn(block) 

print(open('out.log').read())