Unicode代碼點轉換中的「語言處理」？

MSDN文檔Char.ConvertFromUtf32狀態：Unicode代碼點轉換中的「語言處理」？

的基本多語種平面（BMP）之外的有效的代碼點總是產生一個有效的替代物對。 但是，根據Unicode標準，BMP中的有效代碼點可能不會產生有效的結果，因爲在轉換中不使用語言處理。因此，使用System.Text :: UTF32Encoding類將批量UTF-32數據轉換爲批量UTF-16數據。

什麼是上面提到的「語言處理」？是否有任何情況下Char.ConvertFromUtf32(i)[0]調用可能會從(char)i爲BMP中的字符提供不同的結果？

來源

2016-03-08 Douglas

for (int i = 0; i < 65535; i++) 
{ 
    char ch1 = (char)i; 

    if (i < 0x0d800 || i > 0xdfff) 
    { 
     string str1 = char.ConvertFromUtf32(i); 

     if (str1.Length != 1) 
     { 
      Console.WriteLine("\\u+{0:x4}: char.ConvertFromUtf32(i).Length = {1}", i, str1.Length); 
     } 

     char ch2 = str1[0]; 

     if (ch1 != ch2) 
     { 
      Console.WriteLine("\\u+{0:x4}: (char)i = 0x{1:x4}, char.ConvertFromUtf32(i)[0] = 0x{2:x4}", i, (int)ch1, (int)ch2); 
     } 
    } 

    byte[] bytes = BitConverter.GetBytes(i); 
    string str2 = Encoding.UTF32.GetString(bytes); 

    if (str2.Length != 1) 
    { 
     Console.WriteLine("\\u+{0:x4}: Encoding.UTF32.GetString(bytes).Length = {1}", i, str2.Length); 
    } 

    char ch3 = str2[0]; 

    if (ch1 != ch3) 
    { 
     Console.WriteLine("\\u+{0:x4}: (char)i = 0x{1:x4}, Encoding.UTF32.GetString(bytes)[0] = 0x{2:x4}", i, (int)ch1, (int)ch3); 
    } 
}

唯一的區別似乎是在0xd800 - 0xdfff範圍，其中char.ConvertFromUtf32()將拋出一個異常，而Encoding.UTF32.GetString()將返回0xfffd爲無效字符。

在reference source上我們可以清楚地看到UTF32字符沒有「特殊處理」。

if (iChar >= 0x10000) 
{ 
    *(chars++) = GetHighSurrogate(iChar); 
    iChar = GetLowSurrogate(iChar); 
} 

// Add the rest of the surrogate or our normal character 
*(chars++) = (char)iChar;

（我省略是這裏的問題無關的代碼的各種線）

來源

2016-03-08 09:42:12 xanatos

感謝編寫代碼來檢查！範圍['U + D800'-'U + DFFF']（https://en.wikipedia.org/wiki/UTF-16#U.2BD800_to_U.2BDFFF）保留給替代字符，它們作爲代碼點是無效的在UTF-16之外，所以異常/後退字符是預期的。如果這是唯一的區別，我認爲MSDN文檔是錯誤的，大概是指一些Unicode規範化，不應該是代碼點轉換的一部分。 – Douglas

@Douglas我甚至檢查過'UTF32Encoding'的參考源，並沒有「特殊處理」。 – xanatos

很遺憾MSDN已經刪除了向其文檔添加註釋的工具。應該指出這樣的錯誤。 – Douglas

Unicode代碼點轉換中的「語言處理」？

回答

相關問題