2013-03-14 53 views
2

python新手,需要我的程序幫助。我有一個代碼,它接受一個未格式化的文本文檔,進行一些格式設置(設置頁面寬度和邊距),並輸出一個新的文本文檔。我的整個代碼工作正常,除了這個產生最終輸出的函數。如何使用text.split()並保留空行(空)

這是問題的代碼段:

def process(document, pagewidth, margins, formats): 
    res = [] 
    onlypw = [] 
    pwmarg = [] 
    count = 0 
    marg = 0 


    for segment in margins: 

     for i in range(count, segment[0]): 
      res.append(document[i]) 
     text = '' 

    foundmargin = -1 
    for i in range(segment[0], segment[1]+1): 
     marg = segment[2] 
     text = text + '\n' + document[i].strip(' ') 

    words = text.split() 

注:段[0]表示文檔的開頭,和段[1]只是意味着該文件結束時,如果你想知道關於範圍。我的問題是當我將文本複製到單詞(單詞= text.split())時,它不保留我的空白行。我應該得到的輸出是:

 This is my substitute for pistol and ball. With a 
     philosophical flourish Cato throws himself upon his sword; I 
     quietly take to the ship. There is nothing surprising in 
     this. If they but knew it, almost all men in their degree, 
     some time or other, cherish very nearly the same feelings 
     towards the ocean with me. 

     There now is your insular city of the Manhattoes, belted 
     round by wharves as Indian isles by coral reefs--commerce 
     surrounds it with her surf. 

什麼我的電流輸出的樣子:

 This is my substitute for pistol and ball. With a 
     philosophical flourish Cato throws himself upon his sword; I 
     quietly take to the ship. There is nothing surprising in 
     this. If they but knew it, almost all men in their degree, 
     some time or other, cherish very nearly the same feelings 
     towards the ocean with me. There now is your insular city of 
     the Manhattoes, belted round by wharves as Indian isles by 
     coral reefs--commerce surrounds it with her surf. 

我知道當我複製文本的話,因爲它不留空白的問題發生線。我怎樣才能確保它複製空白行和單詞? 請讓我知道如果我應該添加更多的代碼或更多的細節!

+0

你可以嘗試先分成幾段,然後處理每個段落 - 第一個'text.split('\ n \ n ')'和split()'的每個段落。 – dmg 2013-03-14 20:27:11

回答

4

至少2換行符,然後分裂的話第一次分裂:

import re 

paragraphs = re.split('\n\n+', text) 
words = [paragraph.split() for paragraph in paragraphs] 

你現在有一個列表的列表,每個段落之一;處理這些每款,之後就可以歸隊了整個事情與在插回雙換行的新文本

我用re.split()支持超過2個換行分隔正在段落。如果在段落之間只有2個換行符,則可以使用簡單的text.split('\n\n')

+0

'\ n {2,}'是「2個或更多換行符」的一個很好的符號,可以很容易地調整到2,3或更多,等等。 – kindall 2013-03-14 20:58:52

+1

@kindall:我意識到符號;在這種情況下,爲了創建與'text.split('\ n \ n')替代我選擇'\ n \ n +'版本的對稱性。 – 2013-03-14 20:59:51

1

使用正規找到的話的空行,而不是分裂

m = re.compile('(\S+|\n\n)') 
words=m.findall(text)