2016-12-02 85 views
3

我想提取有關從幾篇文章中受傷的人的信息。問題在於以新聞語言傳達這些信息的方式不同,因爲它可以用數字或文字書寫。正則表達式結合列表中的數字寫成字

例如:

`Security forces had *wounded two* gunmen inside the museum but that two or three accomplices might still be at large.` 

`The suicide bomber has wounded *four men* last night.` 

`*Dozens* were wounded in a terrorist attack.` 

我注意到,因爲大部分時間數字,1-10去的都寫在單詞而不是數字。我想知道如何提取它們而不會產生任何令人費解的代碼,只需從1-10的單詞列出正則表達式即可。

我應該使用一個列表嗎?它將如何包括在內?

這是我迄今爲止用於提取人與數字受傷人數的模式:

text_open = open("News") 
text_read = text_open.read() 
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) injured|(\d+) people were wounded|wounding (\d+)|wounding at least (\d+)") 
result = re.findall(pattern,text_read) 
print(result) 

回答

1

試試這個

import re 

regex = r"(\w)+\s(?=were)|(?<=wounded|injured)\s[\w]{3,}" 

test_str = ("`Security forces had wounded two gunmen inside the museum but that two or three accomplices might still be at large.`\n\n" 
    "`The suicide bomber has wounded four men last night.`\n\n" 
    "`Dozens were wounded in a terrorist attack.") 

matches = re.finditer(regex, test_str) 

for match in matches:  
    print (match.group().strip()) 

輸出:

two 
four 
Dozens 

\w+\s(?=were)?=展望未來were,找到捕獲字使用\w

|

(?<=wounded|injured)\s\w{3,}?<=如果受傷或受傷的字前發生和{3,}平均字的長度爲3個或更多,只是爲了避免拍攝字即in,每個數字字有分鐘向後看,捕捉字長度爲3,所以可以使用它。