我有一組類似的鏈接:應用re.sub替換文字過多
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.html?partner=rss&emc=rss" rel="standout"></atom:link>',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.html</guid>',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.html?partner=rss&emc=rss',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.html</guid>',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.html?partner=rss&emc=rss']
我試圖對他們進行迭代刪除html
之後到來的一切。所以,我有:
cleanitems = []
for item in links:
cleanitems.append(re.sub(r'html(.*)', '', item))
將返回:
['http://www.nytimes.com/2016/12/31/us/politics/house-republicans-health-care-suit.',
'http://www.nytimes.com/2016/12/31/nyregion/bronx-murder-40th-precinct-police-residents.',
'http://www.nytimes.com/2016/12/30/movies/tyrus-wong-dies-bambi-disney.',
'http://www.nytimes.com/2016/12/30/obituaries/among-deaths-in-2016-a-heavy-toll-in-pop-music.',
'http://www.nytimes.com/video/world/100000004830728/daybreak-around-the-world.]
困惑,爲什麼它包括捕獲組html
。謝謝你的幫助。
您還可以移除'html'。將'html'放入替換字符串中以保留它。 –