BeautifulSoup解析的問題

我想用BeautifulSoup解析html頁面，但看起來BeautifulSoup根本不喜歡html或那個頁面。當我運行下面的代碼時，prettify（）方法只返回頁面的腳本塊（參見下文）。有人有一個想法，爲什麼會發生？BeautifulSoup解析的問題

import urllib2 
from BeautifulSoup import BeautifulSoup 

url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1" 
html = "".join(urllib2.urlopen(url).readlines()) 
print "-- HTML ------------------------------------------" 
print html 
print "-- BeautifulSoup ---------------------------------" 
print BeautifulSoup(html).prettify()

這是BeautifulSoup生成的輸出。

-- BeautifulSoup --------------------------------- 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
<script language="JavaScript"> 
<!-- 
    function highlight(img) { 
     document[img].src = "/marketing/sony/images/en/" + img + "_on.gif"; 
    } 

    function unhighlight(img) { 
     document[img].src = "/marketing/sony/images/en/" + img + "_off.gif"; 
    } 
//--> 
</script>

謝謝！

更新：我使用的是以下版本，它看起來是最新版本。

__author__ = "Leonard Richardson ([email protected])" 
__version__ = "3.1.0.1" 
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson" 
__license__ = "New-style BSD"

來源

2009-03-02 Martin

建議使用版本3.0.7a，如Łukasz。 BeautifulSoup 3.1被設計爲與Python 3.0兼容，因此他們必須將解析器從SGMLParser更改爲HTMLParser，而HTMLParser似乎更容易受到錯誤的HTML影響。

從changelog for BeautifulSoup 3.1：

「美麗的湯現在基於HTMLParser的，而不是SGMLParser中，這是走在Python 3有一些不好的HTML，就是SGMLParser處理，但HTMLParser的沒有」

來源

2009-03-02 09:16:27 miles82

這個位置的一些詳細信息：HTTP：// WWW .crummy.com/software/BeautifulSoup/3.1-problems.html – FeatureCreep 2009-11-21 19:13:21

BeautifulSoup並不神奇：如果傳入的HTML太可怕了，那麼它就不起作用。

在這種情況下，傳入的HTML就是這樣的：對於BeautifulSoup來說弄不清楚該做什麼。例如，它包含標記，如：

SCRIPT TYPE =「」 JavaScript的「」

的BeautifulSoup文檔包含一個部分，如果BeautifulSoup無法解析你能做什麼（注意雙引號）。你標記。您需要調查這些替代方案。

來源

2009-03-02 04:09:28 Justus

我在BeautifulSoup版本'3.0.7a'上測試了這個腳本，它返回了看起來是正確的輸出。我不知道'3.0.7a'和'3.1.0.1'之間有什麼變化，但嘗試一下。

來源

2009-03-02 08:31:44

import urllib 
from BeautifulSoup import BeautifulSoup 

>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1') 
>>> soup = BeautifulSoup(page) 
>>> soup.prettify()

在我的情況下，通過執行上述語句，它返回整個HTML頁面。

來源

2009-03-06 07:31:58 aatifh

在給任何人投票之前給出適當的理由。這將有點道德。哦!如果你不明白我的答案，那麼上帝可以幫助你 – aatifh 2009-03-09 07:02:35

我有問題解析下面的代碼太：

<script> 
     function show_ads() { 
      document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>"); 
     } 
</script>

HTMLParseError：壞的結束標記：U ''，在第26行，列127

山姆

來源

2009-04-20 11:39:53

嘗試lxml。儘管它的名字，它也用於解析和刮取HTML。它比BeautifulSoup快得多，它甚至比BeautifulSoup處理「破碎的」HTML更好，所以它可能對你更好。如果您不想學習lxml API，它也具有用於BeautifulSoup的兼容性API。

Ian Blicking agrees。

沒有理由再使用BeautifulSoup，除非您使用的是Google App Engine或其他任何不是純粹Python不允許的東西。

來源

2009-08-03 15:39:32 aehlke

Samj：如果我得到的東西像 HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>" 我剛剛從標記刪除的罪魁禍首之前，我把它用來BeautifulSoup和所有爲花花公子：

html = urllib2.urlopen(url).read() 
html = html.replace("</scr' + 'ipt>","") 
soup = BeautifulSoup(html)

來源

2010-07-13 20:00:35

BeautifulSoup解析的問題

回答

相關問題