我想使用lxml解析下載的RSS,但我不知道如何處理UnicodeDecodeError?使用lxml解析RSS時出現編碼錯誤
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
但我得到一個錯誤:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
當我運行沒有編碼參數...; /的腳本時,仍然有相同的錯誤。爲什麼etree.XMLParser完成錯誤,儘管傳遞正確的編碼? – domi 2011-04-28 00:45:50
它現在正在工作,但我不得不升級lxml到2.2.8版本,因爲2.2.4我無法解析遠程URL。此外,當我改變這個問題時,我的問題的代碼工作:tree = etree.parse(StringIO.StringIO(response),parser) – domi 2011-04-28 20:46:39