C++：的std :: string問題

我有這個簡單的代碼：C++：的std :: string問題

#include <iostream> 
#include <fstream> 

using namespace std; 

int main(void) 
{ 
    ifstream in("file.txt"); 
    string line; 
    while (getline(in, line)) 
    { 
     cout << line << " starts with char: " << line.at(0) << " " << (int) line.at(0) << endl; 
    } 
    in.close(); 
    return 0; 
}

它打印：

0.000000 0.000000 0.010909 0.200000 starts with char: 32 
A 0.023636 0.000000 0.014545 0.200000 starts with char: A 65 
B 0.050909 0.000000 0.014545 0.200000 starts with char: B 66 
C 0.078182 0.000000 0.014545 0.200000 starts with char: C 67 

... 

, 0.152727 0.400000 0.003636 0.200000 starts with char: , 44 
< 0.169091 0.400000 0.005455 0.200000 starts with char: < 60 
. 0.187273 0.400000 0.003636 0.200000 starts with char: . 46 
> 0.203636 0.400000 0.005455 0.200000 starts with char: > 62 
/0.221818 0.400000 0.010909 0.200000 starts with char:/47 
? 0.245455 0.400000 0.009091 0.200000 starts with char: ? 63 
¡ 0.267273 0.400000 0.005455 0.200000 starts with char: � -62 
£ 0.285455 0.400000 0.012727 0.200000 starts with char: � -62 
¥ 0.310909 0.400000 0.012727 0.200000 starts with char: � -62 
§ 0.336364 0.400000 0.009091 0.200000 starts with char: � -62 
© 0.358182 0.400000 0.016364 0.200000 starts with char: � -62 
® 0.387273 0.400000 0.018182 0.200000 starts with char: � -62 
¿ 0.418182 0.400000 0.009091 0.200000 starts with char: � -62 
À 0.440000 0.400000 0.012727 0.200000 starts with char: � -61 
Á 0.465455 0.400000 0.014545 0.200000 starts with char: � -61

奇怪......我怎樣才能得到真正的string的第一個字符？

在此先感謝！

來源

2010-08-14 Martijn Courteaux

@Martjn：當你*標記*您的問題用C++，沒有必要把僞標籤，如「C++：」在標題。 – dmckee 2010-08-16 15:37:04

您正在獲取字符串中的第一個字符。

但它看起來像字符串是一個UTF-8字符串（或可能是其他一些多字節字符格式）。

這意味着os打印的每個符號（字形）都由1（或更多字符）組成。
如果是UTF-8，那麼ASCII（0-127）範圍之外的任何字符實際上都由2個（或更多字符）組成，並且字符串打印代碼正確地解釋了這一點。但是字符打印代碼不可能正確解碼大於127的單個字符。

個人而言，我認爲動態寬度字符格式不是在程序內部使用的好主意（它們都可以用於運輸和存儲），因爲它們使得字符串操作更爲複雜。我建議您將字符串轉換爲固定寬度格式以供內部處理，然後將其轉換回UTF-8進行存儲。我個人會在內部使用UTF-16（或UTF-32取決於wchar_t是什麼）內部（是的，我從技術上知道UTF-16不是固定寬度，但在所有合理的教學環境中，它是固定寬度的沙腳本然後我們可能需要使用UTF-32））。您只需使用適當的codecvt facet來填充輸入/輸出流以進行自動翻譯。在內部，代碼可以在單個字符使用wchar_t類型的情況下進行操作。

來源

2010-08-14 15:29:20

這可能也有幫助http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring – celavek 2010-08-14 15:47:42

你可以請一個例子，使用'codecvt'方面從UTF-8轉換爲'wchar_t'？ – Philipp 2010-08-14 17:14:40

找到助推器。雖然它看起來像測試代碼：http://beta.boost.org/doc/libs/1_35_0/libs/serialization/doc/codecvt。html – 2010-08-14 18:59:23

我覺得最後一個字符屬於擴展ASCII表，一些東西，C++不支持

ASCII Table

EDIT1：沒有從快速尋找底部的文字似乎沒有在擴展ASCII以及。也許看看馬丁約克說的話。

來源

2010-08-14 15:27:47 Muggen

字符串是char的容器，只有一個字節。它只能用於Ascii字符串或二進制數據。任何不在這種情況下的應該使用Unicode，使用wstring，wchar_t的容器。

但是，你的Unicode文本編碼的問題仍然存在，爲此，請參閱上面的答案。

來源

2010-08-14 16:51:14 user420483

如果使用適當的編碼（如UTF-8），'std :: string'可以存儲Unicode字符串。 Unicode不是一種編碼。 – Philipp 2010-08-14 16:59:19

儘管可能，但這並不是很好，因爲您無法可靠地使用[0]。如果您錯誤地使用它們，建立抽象（字符而非字節）的意義何在？ – user420483 2010-08-14 17:19:26

該文件是UTF-8編碼的。使用Unicode庫如ICU獲得訪問代碼點：

#include <iostream> 
#include <fstream> 
#include <utility> 

#include "unicode/utf.h" 

using namespace std; 

const pair<UChar32, int32_t> 
getFirstUTF8CodePoint(const string& str) { 
    const uint8_t* ptr = reinterpret_cast<const uint8_t*>(str.data()); 
    const int32_t length = str.length(); 
    int32_t offset = 0; 
    UChar32 cp = 0; 
    U8_NEXT(ptr, offset, length, cp); 
    return make_pair(cp, offset); 
} 

int main(void) 
{ 
    ifstream in("file.txt"); 
    string line; 
    while (getline(in, line)) 
    { 
     pair<UChar32, string::size_type> cp = getFirstUTF8CodePoint(line); 
     cout << line << " starts with char: " << line.substr(0, cp.second) << " " << static_cast<unsigned long>(cp.first) << endl; 
    } 
    in.close(); 
    return 0; 
}

來源

2010-08-14 16:57:51 Philipp

C++：的std :: string問題

回答

相關問題