在Python中刪除html標記和字符串

我很新，正則表達式。基本上，我想使用正則表達式使用正則表達式從字符串中刪除<sup> ... </sup>。在Python中刪除html標記和字符串

輸入：

<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>

輸出：

<b>something here</b>, another here

是，在如何做到這一點的簡便方法和說明？

note這個問題可能會被重複。我試過但找不到解決方案。

來源

2016-08-19 titipata

正則表達式不是處理html的方式，使用html解析器。 html不是一個簡單的字符串，它是結構化數據。最容易使用的是beautifulsoup，但它只是一個更高效的庫的包裝，你也可以使用它。 –

我有像上面那樣的短字符串列表。我想使用正則表達式將無需使用HTML解析器 – titipata

難的部分正在知道如何做一個最小化而不是標籤之間的最大匹配。這工作。

import re 
s0 = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>" 
prog = re.compile('<sup>.*?</sup>') 
s1 = re.sub(prog, '', s0) 
print(s1) 
# <b>something here</b>, another here

來源

2016-08-19 19:52:47

Ryan用相同的答案毆打。 –

謝謝@Terry。這是非常好的:) – titipata

你可以做這樣的事情：

import re 
s = "<b>something here</b><sup>1</sup><sup>,3</sup>, another here<sup>1</sup>" 

s2 = re.sub(r'<sup>(.*?)</sup>',"", s) 

print s2 
# Prints: <b>something here</b>, another here

記住使用(.*?)，作爲(.*)就是他們所說的貪婪量詞，你會得到不同的結果：

s2 = re.sub(r'<sup>(.*)</sup>',"", s) 

print s2 
# Prints: <b>something here</b>

來源

2016-08-19 19:48:43 Ryan

謝謝@Ryan，這正是我正在尋找的。 – titipata

在Python中刪除html標記和字符串

回答

相關問題