2017-08-25 45 views
0

我已經閱讀了幾篇文章,其中包括this之一。但沒有任何幫助。將大文件分割成python中的較小文件時出現內存錯誤

這裏是Python代碼,我有目前,其將文件

我輸入的文件大小爲15G,我拆分爲128MB。我的電腦有8G內存

import sys 

def read_line(f_object,terminal_byte): 
    line = ''.join(iter(lambda:f_object.read(1),terminal_byte)) 
    line+="\x01" 
    return line 

def read_lines(f_object,terminal_byte): 
    tmp = read_line(f_object,terminal_byte) 
    while tmp: 
     yield tmp 
     tmp = read_line(f_object,terminal_byte) 

def make_chunks(f_object,terminal_byte,max_size): 
    current_chunk = [] 
    current_chunk_size = 0 
    for line in read_lines(f_object,terminal_byte): 
     current_chunk.append(line) 
     current_chunk_size += len(line) 
     if current_chunk_size > max_size: 
      yield "".join(current_chunk) 
      current_chunk = [] 
      current_chunk_size = 0 
    if current_chunk: 
     yield ''.join(current_chunk) 

inputfile=sys.argv[1] 

with open(inputfile,"rb") as f_in: 
    for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)): 
     with open("out%d.txt"%i,"wb") as f_out: 
      f_out.write(chunk) 

當我執行該腳本,我得到以下錯誤:與其尋找每一個\x01

Traceback (most recent call last): 
    File "splitter.py", line 30, in <module> 
    for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)): 
    File "splitter.py", line 17, in make_chunks 
    for line in read_lines(f_object,terminal_byte): 
    File "splitter.py", line 12, in read_lines 
    tmp = read_line(f_object,terminal_byte) 
    File "splitter.py", line 4, in read_line 
    line = ''.join(iter(lambda:f_object.read(1),terminal_byte)) 
MemoryError 
+1

什麼是終端字節?在使用8千兆字節的內存之前它實際上是否找到它?換句話說,你在哪裏期待\ x01'? –

+0

此外,你的'max_size'是131072000.但是這是*行數*,所以,只是*列表本身,不計算內容*將是'1024 * 1000 * 128 * 1e-9 *(8)千兆字節,大約是1.05千兆字節......同樣,這不包括* current_chunk列表中包含的*實際對象。 「快速棕色狐狸跳過懶惰狗」的字符串大小約爲81字節,因此平均大小的許多字符串將需要1024 * 1000 * 128 * 1e-8 * 81千兆字節,大約是10.6演出!你的代碼註定要從一開始就失敗...... –

+1

基本上,如果你試圖在'128MB'塊中讀/寫,那麼所有這些似乎都是不必要的......你可以'f_out.write(f_in.read( 128000))'在一個循環中......無論如何,這個rigmarole的其餘部分都假設完成了嗎? –

回答

1

Question: splitting big file into smaller files

只在最後chunk做到這一點。
將文件指針重置爲offset+1,最後找到\x01,並在當前塊文件中繼續或寫入offset,並在下一個塊文件中寫入chunk的剩餘部分。

Note: Your chunk_size should be io.DEFAULT_BUFFER_SIZE or a multiple of that.
You gain no speedup if you raise the chunk_size to high.
Read this relevant SO QA: Default buffer size for a file

我的例子顯示使用重置文件指針,例如的:

import io 

large_data = b"""Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01""" 

def split(chunk_size, split_size): 
    with io.BytesIO(large_data) as fh_in: 
     _size = 0 
     # Used to verify chunked writes 
     result_data = io.BytesIO() 

     while True: 
      chunk = fh_in.read(chunk_size) 
      print('read({})'.format(bytearray(chunk))) 
      if not chunk: break 

      _size += chunk_size 
      if _size >= split_size: 
       _size = 0 
       # Split on last 0x01 
       l = len(chunk) 
       print('\tsplit_on_last_\\x01({})\t{}'.format(l, bytearray(chunk))) 

       # Reverse iterate 
       for p in range(l-1, -1, -1): 
        c = chunk[p:p+1] 
        if ord(c) == ord('\x01'): 
         offset = l-(p+1) 

         # Condition if \x01 is the Last Byte in chunk 
         if offset == 0: 
          print('\toffset={} write({})\t\t{}'.format(offset, l - offset, bytearray(chunk))) 
          result_data.write(chunk) 
         else: 
          # Reset Fileppointer 
          fh_in.seek(fh_in.tell()-offset) 
          print('\toffset={} write({})\t\t{}'.format(offset, l-offset, bytearray(chunk[:-offset]))) 
          result_data.write(chunk[:-offset]) 
         break 
      else: 
       print('\twrite({}) {}'.format(chunk_size, bytearray(chunk))) 
       result_data.write(chunk) 

     print('INPUT :{}\nOUTPUT:{}'.format(large_data, result_data.getvalue())) 

if __name__ == '__main__': 
    split(chunk_size=30, split_size=60) 

Output:

read(bytearray(b'Lorem ipsum\x01dolor sit\x01sadipsci')) 
    write(30) bytearray(b'Lorem ipsum\x01dolor sit\x01sadipsci') 
read(bytearray(b'ng elitr, sed\x01labore et\x01dolore')) 
    split_on_last_\x01(30) bytearray(b'ng elitr, sed\x01labore et\x01dolore') 
    offset=6 write(24)  bytearray(b'ng elitr, sed\x01labore et\x01') 
read(bytearray(b'dolores et ea rebum.\x01magna ali')) 
    write(30) bytearray(b'dolores et ea rebum.\x01magna ali') 
read(bytearray(b'quyam erat,\x01')) 
    split_on_last_\x01(12) bytearray(b'quyam erat,\x01') 
    offset=0 write(12)  bytearray(b'quyam erat,\x01') 
read(bytearray(b'')) 
INPUT :b'Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01' 
OUTPUT:b'Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01' 

測試使用Python 3.4.2

相關問題