Python HTML刪除

如何從Python中的字符串中刪除所有HTML？例如，我又怎麼能：Python HTML刪除

blah blah <a href="blah">link</a>

到

blah blah link

謝謝！

來源

2009-02-28 user29772

可能會出於您的目的矯枉過正，但如果您的字符串有更復雜或格式錯誤的HTML，請嘗試BeautifulSoup。警告：我認爲它還不適用於Python 3.0。 – bernie 2009-02-28 22:51:17

您可以使用正則表達式來去除所有標籤：

>>> import re 
>>> s = 'blah blah <a href="blah">link</a>' 
>>> re.sub('<[^>]*>', '', s) 
'blah blah link'

來源

2009-02-28 22:43:17

您可以將您的正則表達式簡化爲'<.*?>'，它將完成相同的結果，但是這與前面假設的格式正確無誤。 – UnkwnTech 2009-02-28 22:45:00

你需要檢查報價嗎？還是那些不允許？你有沒有 2009-02-28 22:45:42

@Unkwntech：我更喜歡<[^>] *>超過<.*?>，因爲前者不需要保持回溯來找到標籤的末尾。 – 2009-02-28 22:50:19

嘗試 Beautiful Soup。丟棄除文本以外的所有內容。

來源

2009-02-28 22:52:16

>>> import re 
>>> s = 'blah blah <a href="blah">link</a>' 
>>> q = re.compile(r'<.*?>', re.IGNORECASE) 
>>> re.sub(q, '', s) 
'blah blah link'

來源

2009-02-28 23:23:36 riza

當您的正則表達式解決方案撞牆時，請嘗試這個超級簡單（可靠）的程序BeautifulSoup。

from BeautifulSoup import BeautifulSoup 

html = "<a> Keep me </a>" 
soup = BeautifulSoup(html) 

text_parts = soup.findAll(text=True) 
text = ''.join(text_parts)

來源

2009-03-01 02:00:18 Triptych

還有一個叫做stripogram的小型圖書館，它可以用來去除部分或全部HTML標籤。所以

from stripogram import html2text, html2safehtml 
# Only allow <b>, <a>, <i>, <br>, and <p> tags 
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p")) 
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide. 
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

，如果你想簡單地去掉所有的HTML，你通過valid_tags =（）的第一個功能：

您可以使用它像這樣。

您可以找到documentation here。

來源

2009-03-01 14:45:46 MrTopf

html2text會做這樣的事情。

來源

2009-03-01 18:38:03 RexE

Regexs，BeautifulSoup，html2text 不起作用如果屬性中有'>'。請參閱Is 「>」 (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

「基於HTML/XML解析器」的解決方案可能有助於解決此類情況，例如，stripogram suggested by @MrTopf確實有效。

這裏的ElementTree爲基礎的解決方案：

####from xml.etree import ElementTree as etree # stdlib 
from lxml import etree 

str_ = 'blah blah <a href="blah">link</a> END' 
root = etree.fromstring('<html>%s</html>' % str_) 
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

輸出：

blah blah link END

來源

2009-03-01 20:42:41 jfs

我剛纔寫的。我需要它。它使用html2text並採用文件路徑，儘管我更喜歡URL。 html2text的輸出存儲在TextFromHtml2Text.text中將其打印出來並存儲起來，並將其輸入到您的寵物金絲雀中。

import html2text 
class TextFromHtml2Text: 

    def __init__(self, url = ''): 
     if url == '': 
      raise TypeError("Needs a URL") 
     self.text = "" 
     self.url = url 
     self.html = "" 
     self.gethtmlfile() 
     self.maytheswartzbewithyou() 

    def gethtmlfile(self): 
     file = open(self.url) 
     for line in file.readlines(): 
      self.html += line 

    def maytheswartzbewithyou(self): 
     self.text = html2text.html2text(self.html)

來源

2012-06-29 17:41:43

有一個簡單的方法是：

def remove_html_markup(s): 
    tag = False 
    quote = False 
    out = "" 

    for c in s: 
      if c == '<' and not quote: 
       tag = True 
      elif c == '>' and not quote: 
       tag = False 
      elif (c == '"' or c == "'") and tag: 
       quote = not quote 
      elif not tag: 
       out = out + c 

    return out

的想法是在這裏解釋：http://youtu.be/2tu9LTDujbw

你可以看到它在這裏工作：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果你對類感興趣（關於使用python進行智能調試）我給你一個鏈接：http://www.udacity.com/overview/Course/cs259/CourseRev/1。免費！

不客氣！ :)

來源

2013-01-22 17:31:08 Medeiros

Python HTML刪除

回答

相關問題