2016-07-28 82 views
1

鑑於輸入: 「」XMLParser的在權利要求菲羅U + 00A0是 「無效UTF-8」

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?> 
<sms body=". what" /> 

當字符之後的在短信標籤的身體屬性中是U+00A0;

我得到的錯誤:

XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)

IIUC,該字符的UTF-8表示爲0xC2 0xA0per Wikipedia。當然,輸入字節72和73分別是194和160。

這看起來像是XMLParser中的一個錯誤,或者我錯過了什麼?

+0

不能再現:'XMLDOMParser解析: '<?XML版本=' '1.0'」編碼= '' UTF-8'獨立=''yes''?> '' –

回答

1

由於蒙蒂光臨救援on the Pharo User's list

You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

There is an encoding declaration with a non-UTF-8 encoding.

There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.