2013-02-27 109 views
1

我想檢查一組語句並查看某些種子字是否出現在句子中。但我想避免使用for seed in line,因爲那會說一個種子詞ring會出現在文字bring子字符串搜索多字字符串 - Python

我也想檢查文檔中是否出現像word with spaces這樣的多字表達式(MWE)。

我試過這個,但這是超級慢,有沒有更快的方式做到這一點?

seed = ['words with spaces', 'words', 'foo', 'bar', 
     'bar bar', 'foo foo foo bar', 'ring'] 

docs = ['these are words with spaces but the drinks are the bar is also good', 
    'another sentence at the foo bar is here', 
    'then a bar bar black sheep, 
    'but i dont want this sentence because there is just nothing that matches my list', 
    'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too'] 

docs_seed = [] 
for d in docs: 
    toAdd = False 
    for s in seeds: 
    if " " in s: 
     if s in d: 
     toAdd = True 
    if s in d.split(" "): 
     toAdd = True 
    if toAdd == True: 
     docs_seed.append((s,d)) 
     break 
print docs_seed 

所需的輸出應該是這樣的:

[('words with spaces','these are words with spaces but the drinks are the bar is also good') 
('foo','another sentence at the foo bar is here'), 
('bar', 'then a bar bar black sheep')] 
+2

在輸出沒有意義的第二條線。 ''那麼一個酒吧黑羊''沒有''foo''在這裏 – 2013-02-27 07:59:28

+0

感謝忘記了錯字 – alvas 2013-02-27 08:05:49

回答

3

考慮使用一個正則表達式:

import re 

pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b') 
pattern.findall(line) 

\b開始端一個 「字」 的(字字符序列)相匹配。

例子:

>>> for line in docs: 
...  print pattern.findall(line) 
... 
['words with spaces', 'bar'] 
['foo', 'bar'] 
['bar', 'bar'] 
[] 
[] 
0

這應該工作,並在一定程度上快於當前的做法:

docs_seed = [] 
for d in docs: 
    for s in seed: 
     pos = d.find(s) 
     if not pos == -1 and (d[pos - 1] == " " 
       and (d[pos + len(s)] == " " or pos + len(s) == len(d))): 
      docs_seed.append((s, d)) 
      break 

find給我們seed值中的地位doc(如果找不到-1,則爲-1),然後檢查值之前和之後的字符是否爲空格(或字符串在子字符串後結束)。這也修復了原始代碼中的錯誤,即多字表達式不需要在字邊界上開始或結束 - 對於像"swords with spaces"這樣的輸入,您的原始代碼將匹配"words with spaces"

+0

有時正則表達式*值得這個麻煩。使用'\ b'來檢測分詞符替換了幾行代碼,並且還處理了許多其他的詞邊界分支。 (如果輸入字符串中有標籤?或標點符號?「我不喜歡帶空格的單詞」,但不會與此代碼相匹配。) – PaulMcG 2013-02-27 08:30:54