應用re.sub替換文字過多

-1

['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&amp;emc=rss" rel="standout"></atom:link>', 
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>', 
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&amp;emc=rss', 
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>', 
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&amp;emc=rss']

我試圖對他們進行迭代刪除html之後到來的一切。所以，我有：

cleanitems = [] 

for item in links: 
    cleanitems.append(re.sub(r'html(.*)', '', item))

將返回：

['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.', 
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.', 
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.', 
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.', 
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]

困惑，爲什麼它包括捕獲組html。謝謝你的幫助。

來源

2017-06-20 snapcrack

您還可以移除'html'。將'html'放入替換字符串中以保留它。 –

html是匹配文本的一部分也是，而不僅僅是(...)組。 re.sub()取代了全部匹配的文本。

包括在更換字面html文本：

cleanitems.append(re.sub(r'html(.*)', 'html', item))

，或者捕獲組，而不是在該部分：

cleanitems.append(re.sub(r'(html).*', r'\1', item))

你可能要考慮使用非貪婪匹配，以及一個$字符串末尾錨點，以防止多次刪除路徑中包含html的URL，並且包括.點以確保您確實只匹配.html擴展：

cleanitems.append(re.sub(r'\.html.*?$', r'.html', item))

但是，如果你的目標是從URL中刪除查詢字符串，考慮解析使用urllib.parse.urlparse()的URL，並重新構建，它沒有查詢字符串或片段標識符：

from urlib.parse import urlparse 

cleanitems.append(urlparse(item)._replace(query='', fragment='').geturl())

但是，這不會刪除惡意的HTML塊;如果您從HTML文檔解析這些URL，請考慮使用real HTML parser而不是正則表達式。

來源

2017-06-20 09:02:47

有趣的是，我認爲它只捕獲了圓括號的小組。另外，我想我還沒有擺脫壞字的習慣。 SO讓我在9分鐘內接受答案。謝謝。 – snapcrack

只是補充Martijn的答案。

你也可以使用一個向後斷言只匹配以下html文字：

cleanitems.append(re.sub(r'(?<=html).*', '', item))

或使用替換字符串，以保持初始部分：

cleanitems.append(re.sub(r'(html).*', r'\1', item))

但作爲已經被馬丁說，您最好使用urllib模塊來正確解析URL

來源

2017-06-20 09:11:18

很好的建議，非常感謝 – snapcrack

應用re.sub替換文字過多

回答

相關問題