2016-04-28 89 views
1

我有一個很長的字符串,我需要在組中進行解析,但需要更多地控制它。使用Python解析文本正則表達式re.findall

import re 

RAW_Data = "Name Multiple Words Testing With 1234 Numbers and this stuff* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name Multiple Words Testing With 3456 Numbers and this stuff2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name Multiple Words Testing With 78910 Numbers and this stuff3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name Multiple Words Testing With 1234 Numbers and this stuff4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name Multiple Words Testing With 1234 Numbers and this stuff5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name Multiple Words Testing With 1234 Numbers and this stuff6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name Multiple Words Testing With 1234 Numbers and this stuff7* ((Bla Bla Bla (Bla Bla) A52 & A53)) Name Multiple Words Testing With 1234 Numbers and this stuff8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name Multiple Words Testing With 1234 Numbers and this stuff9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name Multiple Words Testing With 1234 Numbers and this stuff10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name Multiple Words Testing With 1234 Numbers and this stuff11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name Multiple Words Testing With 1234 Numbers and this stuff12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name Multiple Words Testing With 1234 Numbers and this stuff13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name Multiple Words Testing With 1234 Numbers and this stuff14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name Multiple Words Testing With 1234 Numbers and this stuff15* ((Bla Bla Bla (Bla Bla) A68 & A69)) Name Multiple Words Testing With 1234 Numbers and this stuff16*" 

fromnode = re.findall('(.*?)(?=\*\s)', RAW_Data) 

print fromnode 

del fromnode 
del RAW_Data 

的結果是: '名稱多字測試使用1234號這東西', '','((唧唧歪歪(BLA BLA)A40 & A41))名稱多字測試使用3456號和這東西2'........等等。

我似乎無法捕捉到只有串像「名稱多字測試使用3456號這東西」,並省略都喜歡的琴絃「((唧唧歪歪(BLA BLA)A40 A41 &)) 」。任何幫助將非常感激。

+0

'Bla ...'的東西是否總是在括號內,'Mul ... Name'的字眼總是相同的? – schwobaseggl

+0

你只想要括號外的東西嗎? – Laurel

+0

是的,Bla Bla Bla的東西總是在雙括號內構成。那裏還有一組單括號。我使用另一個re.findall(('\(\((。*?)\)',RAW_Data)來捕獲這些部分,現在我想忽略它們。同樣的,雖然我在這裏扔了一些文字,有多個單詞,空格和數字,就像是一種捕捉所有的東西 – user1457123

回答

4

可以與

r'\*\s*\({2}.*?\){2}\s*' 

圖案(see demo)劃分相符:

  • \* - 字面星號
  • \s* - 零個或多個空格
  • \({2} - 正好2開口圓括號
  • .*? - 除換行符以外的零個或多個字符:儘可能少到第一
  • \){2}(注意:如果你需要跨越幾行匹配添加re.S標誌) - 雙右括號
  • \s* - 0+空白。

另外:same, but unrolled (thus, a bit more efficient) regex

\*\s*\({2}[^)]*(?:\)(?!\))[^)]*)*\){2}\s* 

IDEONE demo

import re 
p = re.compile(r'\*\s*\({2}.*?\){2}\s*') 
test_str = "Name Multiple Words Testing With 1234 Numbers and this stuff* ((Bla Bla Bla (Bla Bla) A40 & A41)) Name Multiple Words Testing With 3456 Numbers and this stuff2* ((Bla Bla Bla (Bla Bla) A42 & A43)) Name Multiple Words Testing With 78910 Numbers and this stuff3* ((Bla Bla Bla (Bla Bla) A44 & A45)) Name Multiple Words Testing With 1234 Numbers and this stuff4* ((Bla Bla Bla (Bla Bla) A46 & A47)) Name Multiple Words Testing With 1234 Numbers and this stuff5* ((Bla Bla Bla (Bla Bla) A48 & A49)) Name Multiple Words Testing With 1234 Numbers and this stuff6* ((Bla Bla Bla (Bla Bla) A50 & A51)) Name Multiple Words Testing With 1234 Numbers and this stuff7* ((Bla Bla Bla (Bla Bla) A52 & A53)) Name Multiple Words Testing With 1234 Numbers and this stuff8* ((Bla Bla Bla (Bla Bla) A54 & A55)) Name Multiple Words Testing With 1234 Numbers and this stuff9* ((Bla Bla Bla (Bla Bla) A56 & A57)) Name Multiple Words Testing With 1234 Numbers and this stuff10* ((Bla Bla Bla (Bla Bla) A58 & A59)) Name Multiple Words Testing With 1234 Numbers and this stuff11* ((Bla Bla Bla (Bla Bla) A60 & A61)) Name Multiple Words Testing With 1234 Numbers and this stuff12* ((Bla Bla Bla (Bla Bla) A62 & A63)) Name Multiple Words Testing With 1234 Numbers and this stuff13* ((Bla Bla Bla (Bla Bla) A64 & A65)) Name Multiple Words Testing With 1234 Numbers and this stuff14* ((Bla Bla Bla (Bla Bla) A66 & A67)) Name Multiple Words Testing With 1234 Numbers and this stuff15* ((Bla Bla Bla (Bla Bla) A68 & A69)) Name Multiple Words Testing With 1234 Numbers and this stuff16*" 
print(re.split(p, test_str)) 

UPDATE

一種使用正則表達式與re.findall

(?:\*\s*\(\([^)]*(?:\)(?!\))[^)]*)*\)\))?\s*([^*]*(?:\*(?!\s*\(\()[^*]*)*)\s* 

regex demo

驚恐於它的外觀?它只是一個簡單得多的展開版本(?:\*\s*\(\(.*?\)\))?\s*(.*?(?=\*\s*(?:\(\(|$)))

查看IDEONE demo

+0

我添加了're.findall'使用的模式。現在,我想這個問題得到了回答。 –

+0

再次感謝Wiktor。我繼續前進,並運行了refindall建議。它也可以工作。我唯一注意到的是它最終捕獲了一個空白。 – user1457123

+0

請分享輸入字符串,我會看到它。 –