2013-04-29 110 views
11

我有一個文本文件中的行數較小的文本文件說really_big_file.txt包含:拆分大文本文件導入通過使用Python

line 1 
line 2 
line 3 
line 4 
... 
line 99999 
line 100000 

我想編寫劃分really_big_file.txt成更小的Python腳本每行300行的文件。例如,small_file_300.txt有1-300行,small_file_600有301-600行,依此類推,直到有足夠的小文件包含來自大文件的所有行。

我希望在最簡單的方法有什麼建議來完成這個使用Python

回答

17

使用itertools grouper配方:

from itertools import izip_longest 

def grouper(n, iterable, fillvalue=None): 
    "Collect data into fixed-length chunks or blocks" 
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx 
    args = [iter(iterable)] * n 
    return izip_longest(fillvalue=fillvalue, *args) 

n = 300 

with open('really_big_file.txt') as f: 
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1): 
     with open('small_file_{0}'.format(i * n), 'w') as fout: 
      fout.writelines(g) 

,而不是存儲在列表中的每一行這種方法的優點在於,它可以逐行工作,因此它不需要將每個small_file一次存儲到內存中。

請注意,這種情況下的最後一個文件將是small_file_100200,但只會一直到line 100000。發生這種情況是因爲fillvalue='',這意味着我寫出什麼也沒有到文件,當我沒有任何更多的行留下來寫,因爲一個組大小不平等分配。你可以通過寫入一個臨時文件來解決這個問題,然後重命名它,而不是象我那樣首先命名它。這是如何做到的。

import os, tempfile 

with open('really_big_file.txt') as f: 
    for i, g in enumerate(grouper(n, f, fillvalue=None)): 
     with tempfile.NamedTemporaryFile('w', delete=False) as fout: 
      for j, line in enumerate(g, 1): # count number of lines in group 
       if line is None: 
        j -= 1 # don't count this line 
        break 
       fout.write(line) 
     os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j)) 

這一次fillvalue=None和我經過的每一行檢查None,當它發生時,我知道這個過程已經完成,所以我減去j1不計填料,然後寫入文件。

+1

如果您使用的是python 3.x中的第一個腳本,請將''''izip_longest''替換爲新的''zip_longest''' https://docs.python.org/3/library/ itertools.html#itertools.zip_longest – 2017-03-22 08:45:27

0
lines_per_file = 300 # Lines on each small file 
lines = [] # Stores lines not yet written on a small file 
lines_counter = 0 # Same as len(lines) 
created_files = 0 # Counting how many small files have been created 

with open('really_big_file.txt') as big_file: 
    for line in big_file: # Go throught the whole big file 
     lines.append(line) 
     lines_counter += 1 
     if lines_counter == lines_per_file: 
      idx = lines_per_file * (created_files + 1) 
      with open('small_file_%s.txt' % idx, 'w') as small_file: 
       # Write all lines on small file 
       small_file.write('\n'.join(stored_lines)) 
      lines = [] # Reset variables 
      lines_counter = 0 
      created_files += 1 # One more small file has been created 
    # After for-loop has finished 
    if lines_counter: # There are still some lines not written on a file? 
     idx = lines_per_file * (created_files + 1) 
     with open('small_file_%s.txt' % idx, 'w') as small_file: 
      # Write them on a last small file 
      small_file.write('n'.join(stored_lines)) 
     created_files += 1 

print '%s small files (with %s lines each) were created.' % (created_files, 
                  lines_per_file) 
+0

唯一的問題是,你必須將每個'small_file'存儲在內存中,然後用這個方法寫出來,可能或者m唉,不是一個問題,但。當然你可以通過修改它來逐行寫入文件來解決這個問題。 – jamylak 2013-04-30 00:32:32

2

我這樣做是一種更容易理解的方式,並且使用更少的捷徑來讓您進一步瞭解它的工作方式和原因。以前的答案很有用,但如果您對某些內置函數不熟悉,您將無法理解該函數在做什麼。

因爲你沒有發佈代碼,所以我決定這樣做,因爲除了基本的Python語法之外,你可能不熟悉這些東西,因爲你說的這個問題看起來好像你沒有嘗試也沒有任何線索。如何處理這個問題

以下是基本的Python做到這一點的步驟:

首先,你應該閱讀您的文件到列表保管:

my_file = 'really_big_file.txt' 
hold_lines = [] 
with open(my_file,'r') as text_file: 
    for row in text_file: 
     hold_lines.append(row) 

其次,你需要設置一種創造新的方式文件名稱!我建議一環一對夫婦櫃檯沿:

outer_count = 1 
line_count = 0 
sorting = True 
while sorting: 
    count = 0 
    increment = (outer_count-1) * 300 
    left = len(hold_lines) - increment 
    file_name = "small_file_" + str(outer_count * 300) + ".txt" 

第三,循環裏面,你需要一些嵌套的循環,將正確的行保存到一個數組:

hold_new_lines = [] 
    if left < 300: 
     while count < left: 
      hold_new_lines.append(hold_lines[line_count]) 
      count += 1 
      line_count += 1 
     sorting = False 
    else: 
     while count < 300: 
      hold_new_lines.append(hold_lines[line_count]) 
      count += 1 
      line_count += 1 

最後一件事,又在你的第一個循環,你需要編寫新的文件,並添加您的最後一個計數器增量所以你的循環將再次經歷譜寫新的文件

outer_count += 1 
with open(file_name,'w') as next_file: 
    for row in hold_new_lines: 
     next_file.write(row) 

注:如果行數不整除b y 300,最後一個文件將有一個不對應於最後一個文件行的名稱。

理解這些循環爲什麼起作用很重要。您已將它設置爲在下一個循環中寫入的文件的名稱發生更改,因爲您的名稱取決於變化的變量。這是文件訪問,開放,寫作,組織等一個非常有用的腳本工具

如果你不能遵循什麼是什麼環路,這裏是功能的全部:

my_file = 'really_big_file.txt' 
sorting = True 
hold_lines = [] 
with open(my_file,'r') as text_file: 
    for row in text_file: 
     hold_lines.append(row) 
outer_count = 1 
line_count = 0 
while sorting: 
    count = 0 
    increment = (outer_count-1) * 300 
    left = len(hold_lines) - increment 
    file_name = "small_file_" + str(outer_count * 300) + ".txt" 
    hold_new_lines = [] 
    if left < 300: 
     while count < left: 
      hold_new_lines.append(hold_lines[line_count]) 
      count += 1 
      line_count += 1 
     sorting = False 
    else: 
     while count < 300: 
      hold_new_lines.append(hold_lines[line_count]) 
      count += 1 
      line_count += 1 
    outer_count += 1 
    with open(file_name,'w') as next_file: 
     for row in hold_new_lines: 
      next_file.write(row) 
+0

優秀@Ryan Saxe! – Lucas 2016-06-09 23:08:42

11
lines_per_file = 300 
smallfile = None 
with open('really_big_file.txt') as bigfile: 
    for lineno, line in enumerate(bigfile): 
     if lineno % lines_per_file == 0: 
      if smallfile: 
       smallfile.close() 
      small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file) 
      smallfile = open(small_filename, "w") 
     smallfile.write(line) 
    if smallfile: 
     smallfile.close() 
+0

不錯,短代碼,像魅力 – MoizNgp 2018-01-30 07:35:56

3
import csv 
import os 
import re 

MAX_CHUNKS = 300 


def writeRow(idr, row): 
    with open("file_%d.csv" % idr, 'ab') as file: 
     writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL) 
     writer.writerow(row) 

def cleanup(): 
    for f in os.listdir("."): 
     if re.search("file_.*", f): 
      os.remove(os.path.join(".", f)) 

def main(): 
    cleanup() 
    with open("large_file.csv", 'rb') as results: 
     r = csv.reader(results, delimiter=',', quotechar='\"') 
     idr = 1 
     for i, x in enumerate(r): 
      temp = i + 1 
      if not (temp % (MAX_CHUNKS + 1)): 
       idr += 1 
      writeRow(idr, x) 

if __name__ == "__main__": main() 
+0

嘿快問題,你介意解釋你爲什麼使用quotechar ='\「'謝謝 – Jiraheta 2016-02-09 20:39:33

+0

我使用它,因爲我有一個不同的報價字符(|)在我的情況。將其設置爲默認引用字符(引號「) – Varun 2016-02-22 08:15:41

+0

對於關注速度的用戶,在大約2.31秒內將包含98500條記錄(大小約13MB)的CSV文件與此代碼分開。我會說這很好。 – 2017-05-08 13:45:41