與蟒蛇

刪除特定HTML標籤我有一個HTML細胞內的一些HTML表格，就像這樣：與蟒蛇

miniTable='<table style="width: 100%%" bgcolor="%s"> 
       <tr><td><font color="%s"><b>%s</b></td></tr> 
      </table>' % (bgcolor, fontColor, floatNumber) 

html += '<td>' + miniTable + '</td>'

有沒有辦法去除，涉及到這個minitable HTML標記，並只有這些html標籤？
我想以某種方式刪除這些標籤：

<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b> 
and 
</b></td></tr></table>

得到這個：

floatNumber

其中floatNumber是一個浮點數的字符串表示。 我不希望任何其他HTML標記以任何方式進行修改。我想使用string.replace或正則表達式，但我很難過。

來源

2012-07-13 jh314

如果您不能安裝和使用美麗的湯（否則BS是首選，因爲@奧托allmendinger建議）：

import re 
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>' 
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))

來源

2012-07-13 14:43:20 fedosov

對於我的應用程序，這個工程太棒了！如果我可以使用美麗的湯，奧托的解決方案也很棒 – jh314 2012-07-13 15:22:29

Do not use str.replace or regex.

使用HTML解析庫像Beautiful Soup，得到你想要的元素，包含的文本。

最後的代碼應該是這個樣子

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html_doc) 

for t in soup.find_all("table"): # the actual selection depends on your specific code 
    content = t.get_text() 
    # content should be the float number

來源

2012-07-13 14:40:06

謝謝爲了快速回復！我正在使用一些專有的開發環境，所以我無法安裝和使用美麗的湯 – jh314 2012-07-13 14:43:30

如果html代碼格式良好，您還可以嘗試使用Python內置的XML解析器。 – 2012-07-13 14:46:47

有趣，但[BS4使用're'解析XHTML]（http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L482）。不要使用正則表達式？好的。 – fedosov 2012-07-13 14:51:58

回答

相關問題