我有段落的數據框,我已將(*可以)分成單詞標記和句子標記,並期望找到所有在出現短語:「貢獻」或「捐獻給」的任何情況下的名詞短語。使用正則表達式在發生特定短語後找到段落中的所有名詞短語
還是真的某種形式的,所以:
"Contributions are welcome to be made to the charity of your choice."
---> would return: "the charity of your choice"
和
"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"
---> would return: "ABC Foundation"
我創建了一個正則表達式的變通,抓住正確的短語時約90%。 ..見下圖:
text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation
我想清理正則表達式來擺脫「{15}」要求因爲它缺少一些我需要的值。不過,我並沒有用「貪婪」的表情來打磨,也不能讓它正常工作。
所以這句話:
While she lived a full life , had many achievements and made many
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName
將返回:「參觀哥姓」由於捐款先前提及,即使單詞「到」一詞後以及15分後的話。
「即使'to'這個詞在15個單詞之後出現。」好。這就是''*'*確實*。它明確地匹配*任意數量的字符。* – Draco18s
我認爲在它後面的「{,15}」最多可以限制15個字。 –
發生在「contrib。*」匹配之前。 – Draco18s