沒有Python unicode錯誤下載html

我想下載page_source到一個文件。然而，每一次我得到一個：沒有Python unicode錯誤下載html

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 (or something else) in 
position 8304: ordinal not in range(128)

我使用value.encode('utf-8')嘗試過，但似乎每次拋出同樣的異常時間（除了手動試圖取代所有的非ASCII字符）。有沒有一種方法可以「預處理」HTML以將其變爲「可寫」格式？

來源

2012-01-09 David542

什麼是文件的實際編碼？ – 2012-01-09 03:11:08

使用UTF8 _而不是_ ASCII。 – SLaks 2012-01-09 03:15:09

有第三方庫，如BeautifulSoup和lxml可以自動處理編碼問題。但這裏是一個使用最原始的例子只是urlllib2：

首先下載一些網頁包含非ASCII字符：

>>> import urllib2 
>>> response = urllib2.urlopen('http://www.ltg.ed.ac.uk/~richard/unicode-sample.html') 
>>> data = response.read()

現在看看在「字符集」頁面的頂部：

>>> data[:200] 
'<html>\n<head>\n<title>Unicode 2.0 test page</title>\n<meta 
content="text/html; charset=UTF-8" http-equiv="Content-type"/>\n 
</head>\n<body>\n<p>This page contains characters from each of the 
Unicode\ncharact'

如果沒有明顯的字符集，無論如何，「UTF-8」通常都是一個很好的猜測。

最後，網頁轉換爲Unicode文本：

>>> text = data.decode('utf-8')

來源

2012-01-09 05:24:17 ekhumoro

謝謝，這解決了我的問題。當用一個基本的python腳本下載頁面時，我得到了一個帶有xce \ xbf \ xb9等的html頁面。 – 2016-12-12 21:38:54

我不確定，但http://www.crummy.com/software/BeautifulSoup/有一個函數.prettify（），它返回格式良好的HTML。您可以嘗試將其用於「預處理」。

來源

2012-01-09 03:11:04

這個問題可能是你試圖去str - >utf-8，當你需要去str - >unicode - >utf-8。換句話說，試試unicode(s, 'utf-8').encode('utf-8')。

有關更多信息，請參見http://farmdev.com/talks/unicode/。

來源

2012-01-09 03:29:08

沒有Python unicode錯誤下載html

回答

相關問題