如何使用re.sub將標籤添加到python中的某些字符串？

我試圖給一些給定的查詢字符串添加標籤，並且標籤應該環繞所有匹配的字符串。例如，我想環繞所有在句子I love downloading iPhone games from my mac.匹配查詢iphone games mac這樣的詞語標籤應當I love downloading iPhone games from my mac.如何使用re.sub將標籤添加到python中的某些字符串？

目前，我試圖

sentence = "I love downloading iPhone games from my mac." 
query = r'((iphone|games|mac)\s*)+' 
regex = re.compile(query, re.I) 
sentence = regex.sub(r'<em>\1</em> ', sentence)

句子輸出

I love downloading <em>games </em> on my <em>mac</em> !

其中\ 1僅由一個字替換（games而不是iPhone games）和t這個詞後面有一些不必要的空格。如何編寫正則表達式來獲得所需的輸出？謝謝！

編輯： 我剛纔意識到，當我在單詞中有單詞時，弗雷德和克里斯的解決方案都有問題。例如，如果我的查詢是game，那麼它將變成games，而我希望它不被突出顯示。另一個例子是theeither不應該突出顯示。

編輯2： 我採取了克里斯的新解決方案，它的工作原理。

來源

2010-11-19 Sean

首先，要獲得所需空間，請將\s*替換爲\s*?，以使其非貪婪。

首次定位：

>>> re.compile(r'(((iphone|games|mac)\s*?)+)', re.I).sub(r'<em>\1</em>', sentence) 
'I love downloading <em>iPhone</em> <em>games</em> from my <em>mac</em>.'

不幸的是，一旦\s*是不貪婪，它分裂的短語，你可以看到。沒有它，它是這樣的，將兩者分組在一起：

>>> re.compile(r'(((iPhone|games|mac)\s*)+)').sub(r'<em>\1</em>', sentence) 
'I love downloading <em>iPhone games </em>from my <em>mac</em>.'

我還想不出如何解決這個問題。

請注意，在這些我已經卡在一個額外的括號+的周圍，以便所有匹配被抓到 - 這是不同之處。

進一步更新：實際上，我可以想辦法解決它。你決定你是否想這樣。

>>> regex = re.compile(r'((iphone|games|mac)(\s*(iphone|games|mac))*)', re.I) 
>>> regex.sub(r'<em>\1</em>', sentence) 
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

更新：把你點大約字邊界考慮，我們只需要在\b少數情況下，單詞邊界匹配增加。

>>> regex = re.compile(r'(\b(iphone|games|mac)\b(\s*(iphone|games|mac)\b)*)', re.I) 
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone games from my mac') 
'I love downloading <em>iPhone games</em> from my <em>mac</em>' 
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone gameses from my mac') 
'I love downloading <em>iPhone</em> gameses from my <em>mac</em>' 
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney games from my mac') 
'I love downloading iPhoney <em>games</em> from my <em>mac</em>' 
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney gameses from my mac') 
'I love downloading iPhoney gameses from my <em>mac</em>' 
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone gameses from my mac') 
'I love downloading miPhone gameses from my <em>mac</em>' 
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone games from my mac') 
'I love downloading miPhone <em>games</em> from my <em>mac</em>' 
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone igames from my mac') 
'I love downloading <em>iPhone</em> igames from my <em>mac</em>'

來源

2010-11-19 02:25:22

他已經用're.I'覆蓋了區分大小寫。 – snapshoe 2010-11-19 02:29:54

沒錯，錯過了。我想這就是爲什麼他使用re.compile而不是re.sub - 似乎只允許在re.sub中添加'flags'。 – 2010-11-19 02:31:09

謝謝！最後一個是完美的。 – Sean 2010-11-19 03:19:01

>>> r = re.compile(r'(\s*)((?:\s*\b(?:iphone|games|mac)\b)+)', re.I) 
>>> r.sub(r'\1<em>\2</em>', sentence) 
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

完全包含加重複額外的組避免了損失的話，而之前的話移的空間 - 但最初採取了領先的空間 - 處理這個問題。單詞邊界斷言需要對它們之間的3個單詞進行全字匹配。但是，NLP很難，並且仍然會有這種情況不能按預期工作。

來源

2010-11-19 03:17:47

如何使用re.sub將標籤添加到python中的某些字符串？

回答

相關問題