使用Python和請求抓取網頁時的字符集問題

嘗試下載中文頁面時（根據元標籤顯示爲gb2312）。我收到亂碼符號ê×××（ò）在我運行下面的代碼之後應該是中文字符，並在gEdit中以gb2312格式打開文件。使用Python和請求抓取網頁時的字符集問題

以下是有問題頁面的源代碼：https://gist.github.com/anonymous/27663069655db7fd7a19 - 實際網站僅適用於教育機構。

我的代碼：

r = requests.post("http://example.com", data=payload, cookies=cookies) 
f = open('myfile.txt', 'w') 
f.write(r.text.encode('gb2312',errors="ignore")) 
f.close()

這個頁面的標題：

{'content-length': '6164', 'x-powered-by': 'ASP.NET', 'date': 'Mon, 11 Mar 2013 05:11:24 GMT', 'cache-control': 'private', 'content-type': 'text/html', 'server': 'Microsoft-IIS/6.0'}

如果我嘗試解碼而不是編碼，我得到這個錯誤在Python：

f.write(r.text.decode('gb2312',errors="ignore")) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2017-2018: ordinal not in range(128)

來源

2013-03-11 user570649

[email protected] http $ python 
Python 2.7.3 (default, Jun 18 2012, 09:39:59) 
[GCC 4.5.3] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import urllib 
>>> rsp = urllib.urlopen('https://gist.github.com/anonymous/27663069655db7fd7a19/raw/836a5c55d0f87a2fa5edcc9a14097c945452f520/chinese.html').read() 
>>> import chardet 
>>> chardet.detect(rsp) 
{'confidence': 0.99, 'encoding': 'utf-8'} 
>>> rsp.decode('utf-8') 
u'\n<HTML><HEAD>(snip)</BODY></HTML>\n'

所以，不要相信charset heade r，我猜？

來源

2013-03-11 10:09:34 djc

使用Python和請求抓取網頁時的字符集問題

回答

相關問題