蟒蛇正則表達式查找單詞

這裏的所有組是我到目前爲止蟒蛇正則表達式查找單詞

text = "Hello world. It is a nice day today. Don't you think so?" 
re.findall('\w{3,}\s{1,}\w{3,}',text) 
#['Hello world', 'nice day', 'you think']

所需的輸出將是[「世界，你好」，「美好的一天」，「今天一天」，「今天不'，'你'，'你認爲']

這可以用一個簡單的正則表達式模式來完成嗎？

來源

2010-10-26 tomfmason

你想達到什麼目的？ – helpermethod 2010-10-26 22:18:16

我想分組所有2（在這種情況下）字符是3個或更多的字符，如上面的例子所需的輸出 – tomfmason 2010-10-26 22:32:56

import itertools as it 
import re 

three_pat=re.compile(r'\w{3}') 
text = "Hello world. It is a nice day today. Don't you think so?" 
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))): 
    if key: 
     group=list(group)  
     for i in range(0,len(group)-1): 
      print(' '.join(group[i:i+2])) 

# Hello world. 
# nice day 
# day today. 
# today. Don't 
# Don't you 
# you think

這我不清楚你想與所有的標點做了什麼。一方面，它看起來像你想要刪除時段，但單引號要保留。實施刪除時間段會很容易，但在此之前，您是否會澄清您想要發生的所有標點符號？

來源

2010-10-26 22:54:25 unutbu

-1

這是一個很好的例子，當不是使用正則表達式進行解析。

來源

2010-10-26 22:13:59 anthony

這是一個很好的例子，當不發佈答案。 – SilentGhost 2010-10-26 22:17:15

好吧，有一個簡單的選擇？ – tomfmason 2010-10-26 22:30:33

map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))

可能是你可以重寫拉姆達較短（如只是「+」）而BTW「不是\ W的一部分或用\ s

來源

2010-10-26 22:40:28 Lachezar

好吧，超級方式：map（「」。join，re.findall（'（\ w {3，}（？=（\ s {1，} \ w {3，}）））'，text ）） – Lachezar 2010-10-26 22:46:42

不錯，但你的例子向我確認，正則表達式將使你的Python看起來像Perl。 – pyfunc 2010-10-26 22:51:26

是的，所有使用regexp的「非常像」Perl，因爲Perl是現今正則表達式的基礎--PCRE（Perl Compatible Reg Exp） - http://en.wikipedia.org/wiki/Regular_expression – Lachezar 2010-10-26 22:57:42

像這樣的事情與列表邊界的附加檢查應這樣做：

>>> text = "Hello world. It is a nice day today. Don't you think so?" 
>>> k = text.split() 
>>> k 
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?'] 
>>> z = [x for x in k if len(x) > 2] 
>>> z 
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?'] 

>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)] 
['Hello world.', 'nice day', "today. Don't", 'you think'] 
>>>

來源

2010-10-26 22:42:30 pyfunc

有時，正則表達式比他們值得的更麻煩。 +1 – jkerian 2010-10-26 22:54:01

有兩個問題你的方法：

無論\ W或\ SM atches標點符號。
當您使用findall匹配正則表達式的字符串時，該字符串的該部分將被消耗。在上一場比賽結束後立即開始搜尋下一場比賽。正因爲如此，一個單詞不能包含在兩個單獨的比賽中。

要解決第一個問題，您需要確定一個詞的含義。正則表達式不適合這種解析。您可能需要查看自然語言解析庫。

但是，假設您可以想出適合您需求的正則表達式，要解決第二個問題，您可以使用lookahead assertion來檢查第二個單詞。這將不會返回整個匹配，但您至少可以使用此方法在每個單詞對中找到第一個單詞。

re.findall('\w{3,}(?=\s{1,}\w{3,})',text) 
        ^^^   ^
        lookahead assertion

來源

2010-10-26 22:42:39

蟒蛇正則表達式查找單詞

回答

相關問題