排序UTF-8字符串？

我的std :: strings以UTF-8編碼，因此std :: string <運算符不會剪切它。我怎麼能比較2 utf-8編碼的std :: strings？排序UTF-8字符串？

它不會削減它是口音，E排Z中後，它不應該

感謝

來源

2011-01-06 jmasterx

爲什麼不標準的`運<`不是「剪」？你想要什麼樣的訂單？ – 2011-01-06 02:45:52

UTF-8編碼的字符串按照與等效的UTF-32編碼字符串相同的順序排序。 – dan04 2011-01-06 02:46:32

如果你不想要一個字典排序（這是什麼排序UTF-8編碼字符串按字典順序會給你），那麼你將需要將你的UTF-8編碼的字符串解碼爲適當的UCS-2或UCS-4，並應用你選擇的適當的比較函數。

要重申一點，UTF-8編碼的機制被巧妙地設計成這樣，如果您排序通過查看每個8位編碼字節的數值，您將得到相同的結果，如果你第一次解碼將字符串轉換爲Unicode並比較每個代碼點的數字值。

更新：您的更新問題表明您想要比純粹的詞典排序更復雜的比較功能。您將需要解碼您的UTF-8字符串並比較解碼的字符。

來源

2011-01-06 02:52:11

編碼（UTF-8,16等）不是問題，而是容器本身是否將字符串視爲Unicode字符串或8位（ASCII或Latin-1）字符串。

我發現Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library，它可以幫助你。

來源

2011-01-06 02:59:59

該標準具有std::locale用於特定於語言環境的事物，例如排序規則（排序）。如果環境包含LC_COLLATE=en_US.utf8或類似的內容，該程序將根據需要對行進行分類。

#include <algorithm> 
#include <functional> 
#include <iostream> 
#include <iterator> 
#include <locale> 
#include <string> 
#include <vector> 
class collate_in : public std::binary_function<std::string, std::string, bool> { 
    protected: 
    const std::collate<char> &coll; 
    public: 
    collate_in(std::locale loc) 
     : coll(std::use_facet<std::collate<char> >(loc)) {} 
    bool operator()(const std::string &a, const std::string &b) const { 
     // std::collate::compare() takes C-style string (begin, end)s and 
     // returns values like strcmp or strcoll. Compare to 0 for results 
     // expected for a less<>-style comparator. 
     return coll.compare(a.c_str(), a.c_str() + a.size(), 
          b.c_str(), b.c_str() + b.size()) < 0; 
    } 
}; 
int main() { 
    std::vector<std::string> v; 
    copy(std::istream_iterator<std::string>(std::cin), 
     std::istream_iterator<std::string>(), back_inserter(v)); 
    // std::locale("") is the locale from the environment. One could also 
    // std::locale::global(std::locale("")) to set up this program's global 
    // first, and then use locale() to get the global locale, or choose a 
    // specific locale instead of using the environment's. 
    sort(v.begin(), v.end(), collate_in(std::locale(""))); 
    copy(v.begin(), v.end(), 
     std::ostream_iterator<std::string>(std::cout, "\n")); 
    return 0; 
}

 
$ cat >file 
f 
é 
e 
d 
^D 
$ LC_COLLATE=C ./a.out file 
d 
e 
f 
é 
$ LC_COLLATE=en_US.utf8 ./a.out file 
d 
e 
é 
f

它已經引起了我的注意，std::locale::operator()(a, b)存在，避免了std::collate<>::compare(a, b) < 0包裝我上面寫的。

#include <algorithm> 
#include <iostream> 
#include <iterator> 
#include <locale> 
#include <string> 
#include <vector> 
int main() { 
    std::vector<std::string> v; 
    copy(std::istream_iterator<std::string>(std::cin), 
     std::istream_iterator<std::string>(), back_inserter(v)); 
    sort(v.begin(), v.end(), std::locale("")); 
    copy(v.begin(), v.end(), 
     std::ostream_iterator<std::string>(std::cout, "\n")); 
    return 0; 
}

來源

2011-01-06 06:16:24 ephemient

一種選擇是使用ICU配頁（http://userguide.icu-project.org/collation/api），它提供了一個正確的國際化的「比較」的方法，你就可以使用排序。

鉻有一個小包裝，應該很容易複製粘貼& /重用

https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

來源

2015-11-04 09:08:43

排序UTF-8字符串？

回答

相關問題