2016-12-06 73 views
1

我是Python的新手,但努力看到我遇到的這個問題的明確答案。我需要將大文本文件分成小於1MB的塊(500000個字符對於1-2個字節的字符是安全的),但是我需要在最近的換行符處打破,而不會超過。由於沒有明確的方法來確定文件大小,我採取了以下方法來找到行的字符數限制達到之前(不完美,但基於這樣的假設,大多數字符是1個字節,這是安全的)python - 將文件拆分爲最高行數的多個txt文件,而不會超過基於字符數的最大文件大小

chars = words = lines = 0 


with open('rawfile.txt', 'r') as in_file: 

     for line in in_file: 
      while chars < 500000: 
       lines += 1 
       words += len(line.split()) 
       chars += len(line) 
     #print lines, words, chars 
     linebreak = lines -1 
     print linebreak 
     chars = words = lines = 0 

返回字符數超過500000個字符限制之前的行。

我努力做到以下幾點:

設置start_line爲0,end_line到linebreak
保存start_lineend_line到一個新的文件
啓動功能,再從線linebreak

有什麼建議?打開更好的方法。

回答

0

不要那樣做;相反,當你第一次閱讀時,寫下行。當你擊中一條即將超過限制的線時,關閉當前文件並開始一個新文件。

chars = words = lines = fnum = 0 
limit = 500000 

in_file = open('newfile_' + str(fnum) + '.txt', 'r') 
with open('rawfile.txt', 'r') as in_file: 

    for line in in_file: 
     lines += 1 
     words += len(line.split()) 
     if chars + len(line) > limit: 
      # close in_file and open the next one 
      in_file.close() 
      fnum += 1 
      chars = words = lines = fnum = 0 
      in_file = open('newfile_' + str(fnum) + '.txt', 'r') 

     in_file.write(line) 
     chars = chars + len(line) 
+0

感謝您的迴應!但是,我無法讓這個開箱即用。我看到一個問題,如果新文件只有讀權限,所以我更新了這個w/w +給它寫訪問權限。同樣,這兩個文件都使用in_file,所以我的猜測是將openfile(newfile,'w')作爲outfile,open(oldfile,'r',encoding ='utf-8')作爲infile實現:'I將玩更多,但感謝讓我走上正軌! – mcraniseq

+0

得到它的工作!我在上面添加了我的腳本,但代碼只生成了一個文件。重讀代碼後,我意識到fmun已經回到0.一旦被刪除,它就像一個魅力,謝謝! ('newfile_'+ str(fnum)+'.txt','w +')作爲outfile,open('rawfile.txt','r')如in_file: #並刪除重置fnum爲0: chars = words = lines = 0 ' – mcraniseq

+0

非常抱歉!是的,fnum重置是我的錯誤 - 在我的「志願者」日結束時複製粘貼。很高興這工作。 – Prune

0

這樣的事情?

# open file for reading 
anin = open('temp.txt') 

# set the char limit 
charlimit = 100 

# index of line being processed 
anindex = 0 

# output text buffer 
anout = '' 

# index of file to output 
acount = 1 

def wrapFile(): 
    global anout 

    if anout == '': return 

    achunk = 'chunk.' + str(acount) + '.txt' 
    achunk = open(achunk, 'w') 
    achunk.write(anout) 
    achunk.close() 
    acount += 1 
    anout = '' 

while True: 
    anindex += 1 
    aline = anin.readline() 

    # EOF case 
    if aline == '': 
     wrapFile() 
     anin.close() 
     break 

    # next line within limit case 
    if len(anout + aline) <= charlimit: 
     anout += aline 
     continue 

    # next line out of limit cases 
    if len(anout) > 0: 
     wrapFile() 

    anout = aline 

    # new line is below char limit itself 
    if len(anout) < charlimit: 
     continue 

    # new line exceeds char limit 
    print 'Line', anindex, 'alone exceeds the given char limit!' 
    wrapFile()