比較C語言中的Unicode字符串比C＃返回不同的值

因此我試圖在C中編寫一個比較函數，它可以採用UTF-8編碼的Unicode字符串並使用Windows CompareStringEx()函數，我期望它能像.NET一樣工作CultureInfo.CompareInfo.Compare()。比較C語言中的Unicode字符串比C＃返回不同的值

現在我用C編寫的函數在一些時間工作，但不是在所有情況下，我試圖找出原因。這是一個失敗的情況下（通過在C＃中，不C）：

CultureInfo cultureInfo = new CultureInfo("en-US"); 
CompareOptions compareOptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth; 

string stringA = "คนอ้วน ๆ"; 
string stringB = "はじめまして"; 
//Result is -1 which is expected 
int result = cultureInfo.CompareInfo.Compare(stringA, stringB);

這裏是我自己寫的C.請記住，這是應該採取UTF-8編碼的字符串，並使用Windows CompareStringEx（）函數使得轉換是必要的。

// Compare flags for the string comparison 
#define COMPARE_STRING_FLAGS (NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH) 

int CompareStrings(int lenA, const void *strA, int lenB, const void *strB) 
{ 
    LCID ENGLISH_LCID = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT); 
    int compareString = -1; 

    // Get the size of the strings as UTF-18 encoded Unicode strings. 
    // Note: Passing 0 as the last parameter forces the MultiByteToWideChar function 
    // to give us the required buffer size to convert the given string to utf-16s 
    int strAWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, NULL, 0); 
    int strBWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, NULL, 0); 

    // Malloc the strings to store the converted UTF-16 values 
    LPWSTR utf16StrA = (LPWSTR) GlobalAlloc(GMEM_FIXED, strAWStrBufferSize * sizeof(WCHAR)); 
    LPWSTR utf16StrB = (LPWSTR) GlobalAlloc(GMEM_FIXED, strBWStrBufferSize * sizeof(WCHAR)); 

    // Convert the UTF-8 strings (SQLite will pass them as UTF-8 to us) to standard 
    // windows WCHAR (UTF-16\UCS-2) encoding for Unicode so they can be used in the 
    // Windows CompareStringEx() function. 
    if(strAWStrBufferSize != 0) 
    { 
     MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, utf16StrA, strAWStrBufferSize); 
    } 
    if(strBWStrBufferSize != 0) 
    { 
     MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, utf16StrB, strBWStrBufferSize); 
    } 

    // Compare the strings using the windows compare function. 
    // Note: We subtract 1 from the size since we don't want to include the null termination character 
    if(NULL != utf16StrA && NULL != utf16StrB) 
    { 
     compareValue = CompareStringEx(L"en-US", COMPARE_STRING_FLAGS, utf16StrA, strAWStrBufferSize - 1, utf16StrB, strBWStrBufferSize - 1, NULL, NULL, 0); 
    } 

    // In the Windows CompareStringEx() function, 0 indicates an error, 1 indicates less than, 
    // 2 indicates equal to, 3 indicates greater than so subtract 2 to maintain C convention 
    if(compareValue > 0) 
    { 
     compareValue -= 2; 
    } 

    return compareValue; 
}

現在，如果我運行下面的代碼，我希望得到的結果是-1基於.NET實現（見上文），但我得到1表明該字符串是大於：

char strA[50] = "คนอ้วน ๆ"; 
char strB[50] = "はじめまして"; 

// Will be 1 when we expect it to be -1 
int result = CompareStrings(strlen(strA), strA, strlen(strB), strB);

關於爲什麼我得到的結果有什麼不同？我在這兩個實現中都使用了相同的LCID/cultureInfo和compareOptions，就我所知，轉換是成功的。

僅供參考：此函數將用作SQLite中的自定義歸類。與問題無關，但如果有人想知道爲什麼函數簽名是這樣的話。

更新：我還確定，當在.NET 4中運行相同的代碼時，我會看到我在本機代碼中看到的行爲。因此，現在.NET版本之間存在差異。看到我的回答下面的原因背後。

來源

2011-09-07 Ian Dallas

與您的問題沒有關係，但是您正在泄漏記憶。 – dalle

'CompareStrings'是否可能將字節視爲某些8位代碼頁中的字符，並應用排序規則而不是比較字節值？我希望Windows的這種破壞行爲... –

@dalle：我如何泄漏內存？對此代碼的任何增強都會很感激。 –

因此，我最終在聯繫Microsoft支持部門後發現了這個問題。這裏是他們不得不說對這個問題：

的理由讓你看到的，即運行CompareInfo.Compare針對相同的字符串與相同的比較選擇，但得到不同的返回值時，根據不同的運行問題.NET Framework的版本是，排序規則與Unicode規範相關聯，Unicode規範隨着時間的推移而變化。歷史上。NET已經將並行版本的數據捕獲到對應於最新版本的Windows以及當時實現的相應版本的Unicode，因此2.0,3.0和3.5對應於Windows XP或Server 2003的版本，而v4.0則與Vista排序規則。因此，各種版本的.NET Framework的排序規則隨着時間而改變。

這也意味着，當我跑的本機代碼我打電話是堅持OT Vista的排序規則的排序方法，當我跑在.NET 3.5我跑的是使用Windows XP的排序規則排序的方法。對我來說似乎很奇怪，Unicode規範會以這種方式發生如此戲劇性的變化，但顯然這就是這種情況。在我看來，以這種戲劇性的方式改變Unicode規範是一種很好的方式來打破向後兼容性。

來源

2011-10-17 22:13:41

那麼，你的代碼在這裏執行幾個步驟 - 目前還不清楚它是否是失敗的比較步驟。

作爲第一步，我會寫出來 - 同時在.NET代碼與C代碼 - 確切的UTF-16代碼單元您已在utf16StrA，utf16StrB，stringA和stringB了。如果您發現在C代碼中使用的輸入數據有問題，我一點也不會感到驚訝。

來源

2011-09-07 19:05:38

感謝您的答覆。有沒有什麼辦法可以在C中顯式聲明UTF-8字符串？我相信你可以通過在字符串前面加上一個L. –

@ tkeE2036來做wchar_t（它們是UCS-2或UTF-16）嗎？我不知道，老實說。我不希望它成爲你使用的類型的一部分，而是一個編譯器開關，以確定你正在使用哪種編碼。 –

不，C中的字符串不能聲明爲UTF-8。 C字符串只是空終止的字節序列。這些序列的編碼由程序或庫來執行。通常，字符串文字包含來自源代碼的文字字節，所以字符串將使用源代碼編碼。 C++ 11確實有一個'u8'前綴，它會告訴編譯器將文本從源代碼編碼轉換爲UTF-8。 – bames53

你在做什麼希望這裏是你的文本編輯器將保存utf-8格式的源代碼文件。然後編譯器會以某種方式將而不是解釋爲UTF-8的源代碼。這是太多的希望，至少在我的編譯器：

warning C4566: character represented by universal-character-name '\u0E04' cannot be represented in the current code page (1252)

修復：

const wchar_t* strA = L"คนอ้วน ๆ"; 
const wchar_t* strB = L"はじめまして";

，並取下轉換代碼。

來源

2011-09-07 19:18:39

當我用新的Unicode字符保存文件時，Visual Studio詢問我想要哪個代碼頁與文檔相關聯，所以我可能在那裏很好。我還需要將strA和strB表示爲UTF-8，然後轉換爲wchar_t，因爲這是SQLite的功能。 –

@ tkeE2036：你選擇了哪個代碼頁？ Visual Studio遇到UTF-8編碼的源文件問題。 – dalle

@dalle：這是UTF-8代碼頁。這不是我想要的嗎？ –

比較C語言中的Unicode字符串比C＃返回不同的值

回答

相關問題