我會從開始的問題:「有沒有辦法,我可以使用另一種解析器可能不太嚴格,並允許UTF-8字符?」
所有XML解析器都將接受以UTF-8編碼的數據。實際上,UTF-8是默認編碼。
一個XML文件可能有這樣的聲明開始:
`<?xml version="1.0" encoding="UTF-8"?>`
或像這樣: <?xml version="1.0"?>
或沒有申報在所有...在每種情況下的解析器將文檔使用UTF解碼-8。
但是,您的數據不是以UTF-8編碼的......它可能是Windows-1252又名cp1252。
如果編碼不是UTF-8,則創建者應該包含一個聲明(或者接收者可以預先設置一個)或者接收者可以將數據轉碼爲UTF-8。以下展示什麼可行,什麼不行:
>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio
>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration
>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8
>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again
>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works
>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception
>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8
>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed
不是歐洲人,我們絕對是在美國。我沒有這樣做,我保證:) – Kekoa 2009-07-16 21:37:35