ICU迭代碼點

我的目標是，通過字符，但下面的代碼迭代Unicode文本字符的字符串，即使我使用next32PostInc（）這是應該重複碼迭代代碼單元代替碼點要點：ICU迭代碼點

void iterate_codepoints(UCharCharacterIterator &it, std::string &str) { 
    UChar32 c; 
    while (it.hasNext()) { 
     c = it.next32PostInc(); 
     str += c; 
    } 
} 

void my_test() { 
    const char testChars[] = "\xE6\x96\xAF"; // Chinese character 斯 in UTF-8 
    UnicodeString testString(testChars, ""); 
    const UChar *testText = testString.getTerminatedBuffer(); 

    UCharCharacterIterator iter(testText, u_strlen(testText)); 

    std::string str; 
    iterate_codepoints(iter, str); 
    std::cout << str; // outputs 斯 in UTF-8 format 
} 


int main() { 
    my_test(); 
    return 0; 
}

上面的代碼產生正確的輸出這是中國性格斯，但3次迭代都發生了這種單個字符，而不是僅僅1.有人能解釋什麼，我做錯了什麼？

簡而言之，我只想遍歷循環中的字符，並且很樂意使用任何需要的ICU迭代類。

仍在試圖解決這個...

我也觀察到使用的UnicodeString下面看到了一些不好的行爲。我使用VC++ 2013

void test_02() { 
    // UnicodeString us = "abc 123 ñ";  // results in good UTF-8: 61 62 63 20 31 32 33 20 c3 b1 
    // UnicodeString us = "斯";    // results in bad UTF-8: 3f 
    // UnicodeString us = "abc 123 ñ 斯"; // results in bad UTF-8: 61 62 63 20 31 32 33 20 c3 b1 20 3f (only the last part '3f' is corrupt) 
    // UnicodeString us = "\xE6\x96\xAF"; // results in bad UTF-8: 00 55 24 04 c4 00 24 
    // UnicodeString us = "\x61";   // results in good UTF-8: 61 
    // UnicodeString us = "\x61\x62\x63"; // results in good UTF-8: 61 62 63 
    // UnicodeString us = "\xC3\xB1";  // results in bad UTF-8: c3 83 c2 b1 
    UnicodeString us = "ñ";     // results in good UTF-8: c3 b1  
    std::string cs; 
    us.toUTF8String(cs); 
    std::cout << cs; // output result to file, i.e.: main >output.txt

}

我使用VC++ 2013

來源

2014-10-19 Caroline Beltran

傳遞一個'字符*'本身到'UnicodeString'構造函數是受平臺的默認代碼頁。 '「 - 」受限於源代碼的字符集，但「斯」不能用8位表示。你的源代碼是UTF-8嗎？這可以解釋你的不良轉換。你將不得不使用一個'UnicodeString'構造函數，它可以讓你指定源數據是UTF-8，這樣它就可以正確轉換。 – 2014-10-20 22:45:12

是的，我的來源是UTF-8格式。 – 2014-10-21 00:19:23

由於源數據是UTF-8，你需要告訴給UnicodeString。它的構造有用於該目的的codepage參數，但你將它設置爲空字符串：

UnicodeString testString(testChars, "");

這告訴UnicodeString執行不變的轉換，這是不是你想要的。您最終得到3個碼點（U + 00E6 U + 0096 U + 00AF）而不是1個碼點（U + 65AF），這就是您的循環迭代三次的原因。

你需要改變你的構造函數調用，讓UnicodeString知道數據是UTF-8，例如：

UnicodeString testString(testChars, "utf-8");

來源

2014-10-20 23:05:43

哇，謝謝雷米，這是我甚至沒有考慮過的事情，我會試驗你的建議，希望在接受之前解決我的問題。 – 2014-10-21 00:17:46

ICU迭代碼點

回答

相關問題