如何讓BeautifulSoup將textarea標籤的內容解析爲HTML？

3.0.5之前，BeautifulSoup用於將文本區域的內容作爲HTML處理。它現在將其視爲文本。我正在解析的文檔在textarea標籤內部有HTML，我正在嘗試處理它。如何讓BeautifulSoup將textarea標籤的內容解析爲HTML？

我已經試過：

for textarea in soup.findAll('textarea'): 
     contents = BeautifulSoup.BeautifulSoup(textarea.contents) 
     textarea.replaceWith(contents.html(text=True))

但我發現了錯誤。我無法在文檔中找到它，而替代解析器不起作用。任何人都知道我可以如何將textareas解析爲HTML？

編輯：

樣本HTML是：

<textarea class="ks-lazyload-custom"> 
    <div class="product-view product-view-rug"> 
    Foobar Womble 
    <div class="product-view-head"> 
     <img src="tps/i1/fo-25.gif" /> 
    </div> 
    </div> 
</textarea>

錯誤是：

File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913, 
in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) 
TypeError: expected string or buffer

我正在尋找的服用元素，提取內容，與解析他們的方式BeautifulSoup，將其摺疊爲文本，然後用該文本替換原始元素的內容（或替換整個元素）。

至於真實世界與規格，這裏其實並不特別相關。數據需要解析，我正在尋找這樣做的方式。

來源

2010-04-19 brofield

您可以包括一個小的HTML片段？ – 2010-04-19 05:54:13

您正試圖獲得HTML解析器來支持HTML規範禁止的數據結構。你應該退後一步，找到解決問題的另一種方式（即不依賴於包含CDATA以外的其他文本的文本文件） – Quentin 2010-04-19 05:54:29

您能否發佈錯誤輸出？沒有它，我們沒有太多可以脫離的地方。 – 2010-04-19 05:54:37

這似乎是工作得相當好（如果我理解正確的，你想要的東西）：

for textarea in soup.findAll('textarea'): 
    contents = BeautifulSoup.BeautifulSoup(textarea.contents[0]).renderContents() 
    textarea.replaceWith(contents)

來源

2010-04-19 17:45:30

謝謝，這似乎確實做了我所追求的。 – brofield 2010-04-20 02:54:23

我現在使用下面的代碼主要工作。你的milage可能會有所不同。

def _extractText(self, data, encoding): 
    if self.isDebug: self._output("_extractText") 
    soup = BeautifulSoup.BeautifulSoup(data, fromEncoding=encoding) 
    comments = soup.findAll(text=lambda text:isinstance(text, BeautifulSoup.Comment)) 
    [comment.extract() for comment in comments] 
    [script.extract() for script in soup.findAll('script')] 
    [css.extract() for css in soup.findAll('style')] 
    for textarea in soup.findAll('textarea'): 
     textarea.string = self._extractText(textarea.renderContents(), 'UTF-8') 
    text = unicode('') 
    for line in soup.findAll(text=True): 
     line = line.replace('&nbsp;', ' ').strip() 
     if line == '': continue 
     if line.startswith('doctype'): continue 
     if line.startswith('DOCTYPE'): continue 
     text = text + line + '\n' 
    return text

來源

2010-04-19 08:01:37 brofield

如何讓BeautifulSoup將textarea標籤的內容解析爲HTML？

回答

相關問題