如何輸入4字節的UTF-8字符？

我正在寫一個小型應用程序，我需要用不同數量的字節長度的utf-8字符進行測試。如何輸入4字節的UTF-8字符？

我可以輸入Unicode字符，以測試爲UTF-8有1,2和3字節只是做精，例如：

string in = "pi = \u3a0";

但如何得到一個Unicode字符是用4字節編碼？我曾嘗試過：

string in = "aegan check mark = \u10102";

據我瞭解應該輸出。但是，當我打印出來，我得到ᴶ0

我錯過了什麼？

編輯：

我把它加入前導零的工作：

string in = "\U00010102";

希望我早已經想到這一點:)

來源

2008-10-15 Cactuar

你用什麼方法打印？它是否知道unicode？ – luke 2008-10-15 13:26:06

我只是使用終端。應用程序的cout與unicode很好地工作。 – Cactuar 2008-10-15 14:02:27

有逃生的較長形式的模式\U後跟八位數字，而不是\u後跟四位數字。這也適用於Java和Python，除其他外：

>>> '\xf0\x90\x84\x82'.decode("UTF-8") 
u'\U00010102'

但是，如果你使用的字節串，爲什麼不逃跑的每個字節像上面，而不是依賴於編譯器逃生轉換爲UTF -8字符串？這似乎是更便攜，以及 - 如果我編譯下面的程序：

#include <iostream> 
#include <string> 

int main() 
{ 
    std::cout << "narrow: " << std::string("\uFF0E").length() << 
     " utf8: " << std::string("\xEF\xBC\x8E").length() << 
     " wide: " << std::wstring(L"\uFF0E").length() << std::endl; 

    std::cout << "narrow: " << std::string("\U00010102").length() << 
     " utf8: " << std::string("\xF0\x90\x84\x82").length() << 
     " wide: " << std::wstring(L"\U00010102").length() << std::endl; 
}

在我目前的CL選項的Win32給出：

warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)

編譯器會嘗試所有的Unicode轉義字符轉換在字節字符串到系統代碼頁，這不像UTF-8不能代表所有的Unicode字符。奇怪的是它已經理解的是，\U00010102是UTF-16（它的內部Unicode表示）\uD800\uDD02，咬傷在錯誤消息中的逃逸...

運行時，該程序打印：

narrow: 2 utf8: 3 wide: 1 
narrow: 2 utf8: 4 wide: 2

注意， UTF-8字節串和寬字符串都是正確的，但編譯器無法轉換"\U00010102"，給出字節字符串"??"，結果不正確。

來源

2008-10-15 14:53:41

如何輸入4字節的UTF-8字符？

回答

相關問題