2016-12-27 68 views
-1

我有一個看起來像這樣一個文本:Python的正則表達式提前

TTL1 | TTL2 | TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
TTL1 | TTL2 | 
TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
some text in a line4 
some text in a line5 
TTL1 | TTL2 | TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
some text in a line4 
... 

說明:我有時可分隔爲多行標題行,然後我有很多的其他線路。 我想要捕捉所有標題(即使它們處於不同的行中),並且還要在一個組中捕捉標題之後的所有行。

我有多行標題和多行內容truoble,我不知道如何提取它與正則表達式和python。

和想法請嗎?

+0

試大熊貓。 http://pandas.pydata.org/ –

+0

@harperkoo我知道熊貓,我怎麼能用它呢?我想用'findall'將數據放在一個列表中,然後使用熊貓。問題是獲取數據。 – TheDaJon

+0

我想你想要過濾所有標題並獲取其餘數據,對不對?嘗試'''df [(df.TTL1!=「TTL1」)''' –

回答

1

你可以試試這個:

\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\s*\n([^\|]*)(?:\n|$) 

由於每個操作的評論,奇怪的線條可以包含|到這使得難以對標題和線之間進行區分,因此下面的溶液可以嘗試:

^\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\n(.*?)(?=^\s*\w+\s*\n*\|\s*\n*\w+\s*\n*\|\s*\n*\w+\s*\n*)|^\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\n(.*)$ 

Updated Regex Explanation

Explanation

示例代碼:

import re 

regex = r"\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\s*\n([^\|]*)(?:\n|$)" 

test_str = ("TTL1 | TTL2 | TTL3\n" 
    "some text in a line1\n" 
    "some text in a line2\n" 
    "some text in a line3\n" 
    "TTL1 | TTL2 | \n" 
    "TTL3\n" 
    "some text in a line1\n" 
    "some text in a line2\n" 
    "some text in a line3\n" 
    "some text in a line4\n" 
    "some text in a line5\n" 
    "TTL1 | TTL2 | TTL3\n" 
    "some text in a line1\n" 
    "some text in a line2\n" 
    "some text in a line3\n" 
    "some text in a line4") 

matches = re.finditer(regex, test_str, re.DOTALL) 

for matchNum, match in enumerate(matches): 
    print(match.group(1)) 
    print(match.group(2)) 
    print(match.group(3)) 
    print(match.group(4)) 

Run it here

樣本輸出:

TTL1 
TTL2 
TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
TTL1 
TTL2 
TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
some text in a line4 
some text in a line5 
TTL1 
TTL2 
TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
some text in a line4 
+0

這幾乎是我想要的,儘管我的行可以包含'|'標誌,並在正則表達式中,我們將其刪除。我怎樣才能包含'|'符號? – TheDaJon

+0

你的意思是組4可以包含|簽名以及? –

+0

是的,它也可能包含它。它可以包含任何字符。 – TheDaJon

0

用下面的辦法與re.findall()功能:

# lines.txt is a file containing the initial text from your question 
with open('lines.txt', 'r') as fh: 
    t = fh.read() 
    items = re.findall(r'([A-Z\d\s|]+)([^A-Z]+)', t) 

# 'h' contains header, 'lines' contains the lines related to current header 
for h, lines in items: 
    print(h.replace('\n', ' '), lines, sep='\n') 

輸出:

TTL1 | TTL2 | TTL3 
some text in a line1 
some text in a line2 
some text in a line3 

TTL1 | TTL2 | TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
some text in a line4 
some text in a line5 

TTL1 | TTL2 | TTL3 
some text in a line1 
some text in a line2 
some text in a line3 
some text in a line4