字節STR轉換失敗python3

的代碼是自我解釋...字節STR轉換失敗python3

$ python3 
Python 3.4.0 (default, Apr 11 2014, 13:05:18) 
[GCC 4.8.2] on linux 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import urllib.request as req 
>>> url = 'http://bangladeshbrands.com/342560550782-44083.html' 
>>> res = req.urlopen(url) 
>>> html = res.read() 
>>> type(html) 
<class 'bytes'> 
>>> html = html.decode('utf-8') # bytes -> str 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66081: invalid start byte

來源

2014-10-30 Dewsworld

爲什麼不使用知道如何通過HTTP正確處理HTML的模塊？ – 2014-10-30 05:08:57

@ IgnacioVazquez-Abrams，你能解釋一下嗎？ read（）方法適用於大多數url。 – Dewsworld 2014-10-30 05:10:20

'read（）'方法不會告訴你有關服務器告訴你HTML的字符集的任何信息。 – 2014-10-30 05:10:59

似乎是在信息的一些不好的Unicode字符您從URL因此需要某種錯誤處理得到。爲什麼不使用請求，即「用Python編寫的HTTP庫，用於人類」。並讓它處理細節：

$ python3 
Python 3.4.2 (default, Oct 15 2014, 22:01:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import requests 
>>> url = 'http://bangladeshbrands.com/342560550782-44083.html' 
>>> r = requests.get(url) 
>>> html_as_text = r.text 
>>> print(html_as_text[66070:66090]) 
ml">Toddler�s items< 
>>>

來源

2014-10-30 10:50:45 FredrikHedman

html頁面可能有inconsistent encodings。內容類型HTTP標頭（res.headers.get_content_charset()）表示它是'utf-8'。 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />裏面的html文件證實了它。但html.decode('utf-8')失敗。

看來問題在於智能報價"’" (U + 2019 RIGHT SINGLE QUOTATION MARK)。它使用cp1252編碼b'\x92'（來自UnicodeDecodeError消息的字節）進行編碼。爲了解決這個問題，你可以使用UnicodeDammit.detwingle()：

from bs4 import UnicodeDammit # $ pip install beautifulsoup4 

text = UnicodeDammit.detwingle(html).decode('utf-8')

雖然這個特定文件，html.decode('cp1252')產生相同的結果，即，它可能只是HTTP服務器和HTML創作工具錯誤的字符編碼規範。

來源

2014-10-30 16:24:18 jfs

字節STR轉換失敗python3

回答

相關問題