2008-10-16 67 views
11

我正嘗試在二進制模式下寫入帶有ofstream的wstring文件,但我認爲我做錯了什麼。這是我已經試過:在例如火狐以二進制模式將utf16寫入文件

ofstream outFile("test.txt", std::ios::out | std::ios::binary); 
wstring hello = L"hello"; 
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t)); 
outFile.close(); 

開幕的test.txt與編碼設置爲UTF-16,它會顯示爲:

你好

有誰能告訴我爲什麼會發生這種情況?

編輯:

打開文件中的十六進制編輯器,我得到:

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00 

看起來像我的每個人物出於某種原因之間得到兩個額外的字節?

+0

添加方面與流從wchar_t的做轉換到正確的輸出相關的地方。見下文。 – 2008-10-16 13:01:42

回答

6

我懷疑sizeof(wchar_t)在你的環境中是4,即它寫出UTF-32/UCS-4而不是UTF-16。這當然是十六進制轉儲的樣子。

這很容易測試(只是打印出sizeof(wchar_t)),但我很確定這是怎麼回事。

要從UTF-32 wstring轉換爲UTF-16,您需要應用正確的編碼,因爲代理對會發揮作用。

+0

是的,你是正確的wchar_t的大小爲4,我在mac。因此,這解釋了很多:)我知道UTF-16中的代理對,將不得不進一步研究。 – Cactuar 2008-10-16 08:01:10

+0

從輸出中你不能告訴它它是UTF-16或UTF-32,它顯示的只是wchar_t是4個字節寬。字符串的編碼不是由語言定義的(儘管它最可能是UCS-4)。 – 2008-10-16 13:10:30

0

您應該在十六進制編輯器(如WinHex)中查看輸出文件,以便查看實際位和字節,以驗證輸出實際上是UTF-16。張貼在這裏,讓我們知道結果。這將告訴我們是否應該責怪Firefox或您的C++程序。

但是在我看來,像您的C++程序一樣工作,Firefox並沒有正確解釋您的UTF-16。 UTF-16爲每個字符調用兩個字節。但Firefox是印刷兩倍多的字符,因爲它應該,所以它可能是試圖解釋你的字符串爲UTF-8或ASCII,一般只需要每個字符1個字節。

當你說「Firefox編碼設置爲UTF16」你是什麼意思?我懷疑這項工作是否奏效。

14

在這裏,我們遇到了很少使用的區域設置屬性。 如果你的輸出字符串作爲一個字符串(而不是原始數據),你可以得到的語言環境做適當的轉換自動神奇。

N.B.此代碼沒有考慮到wchar_t的字符的帳戶edianness。

#include <locale> 
#include <fstream> 
#include <iostream> 
// See Below for the facet 
#include "UTF16Facet.h" 

int main(int argc,char* argv[]) 
{ 
    // construct a custom unicode facet and add it to a local. 
    UTF16Facet *unicodeFacet = new UTF16Facet(); 
    const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet); 

    // Create a stream and imbue it with the facet 
    std::wofstream saveFile; 
    saveFile.imbue(unicodeLocale); 


    // Now the stream is imbued we can open it. 
    // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail. 
    saveFile.open("output.uni"); 
    saveFile << L"This is my Data\n"; 


    return(0); 
}  

該文件:UTF16Facet.h

#include <locale> 

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> 
{ 
    typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType; 
    typedef MyType::state_type   state_type; 
    typedef MyType::result    result; 


    /* This function deals with converting data from the input stream into the internal stream.*/ 
    /* 
    * from, from_end: Points to the beginning and end of the input that we are converting 'from'. 
    * to, to_limit: Points to where we are writing the conversion 'to' 
    * from_next:  When the function exits this should have been updated to point at the next location 
    *     to read from. (ie the first unconverted input character) 
    * to_next:   When the function exits this should have been updated to point at the next location 
    *     to write to. 
    * 
    * status:   This indicates the status of the conversion. 
    *     possible values are: 
    *     error:  An error occurred the bad file bit will be set. 
    *     ok:   Everything went to plan 
    *     partial: Not enough input data was supplied to complete any conversion. 
    *     nonconv: no conversion was done. 
    */ 
    virtual result do_in(state_type &s, 
          const char *from,const char *from_end,const char* &from_next, 
          wchar_t  *to, wchar_t *to_limit,wchar_t* &to_next) const 
    { 
     // Loop over both the input and output array/ 
     for(;(from < from_end) && (to < to_limit);from += 2,++to) 
     { 
      /*Input the Data*/ 
      /* As the input 16 bits may not fill the wchar_t object 
      * Initialise it so that zero out all its bit's. This 
      * is important on systems with 32bit wchar_t objects. 
      */ 
      (*to)        = L'\0'; 

      /* Next read the data from the input stream into 
      * wchar_t object. Remember that we need to copy 
      * into the bottom 16 bits no matter what size the 
      * the wchar_t object is. 
      */ 
      reinterpret_cast<char*>(to)[0] = from[0]; 
      reinterpret_cast<char*>(to)[1] = from[1]; 
     } 
     from_next = from; 
     to_next  = to; 

     return((from > from_end)?partial:ok); 
    } 



    /* This function deals with converting data from the internal stream to a C/C++ file stream.*/ 
    /* 
    * from, from_end: Points to the beginning and end of the input that we are converting 'from'. 
    * to, to_limit: Points to where we are writing the conversion 'to' 
    * from_next:  When the function exits this should have been updated to point at the next location 
    *     to read from. (ie the first unconverted input character) 
    * to_next:   When the function exits this should have been updated to point at the next location 
    *     to write to. 
    * 
    * status:   This indicates the status of the conversion. 
    *     possible values are: 
    *     error:  An error occurred the bad file bit will be set. 
    *     ok:   Everything went to plan 
    *     partial: Not enough input data was supplied to complete any conversion. 
    *     nonconv: no conversion was done. 
    */ 
    virtual result do_out(state_type &state, 
          const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next, 
          char   *to, char   *to_limit, char*   &to_next) const 
    { 
     for(;(from < from_end) && (to < to_limit);++from,to += 2) 
     { 
      /* Output the Data */ 
      /* NB I am assuming the characters are encoded as UTF-16. 
      * This means they are 16 bits inside a wchar_t object. 
      * As the size of wchar_t varies between platforms I need 
      * to take this into consideration and only take the bottom 
      * 16 bits of each wchar_t object. 
      */ 
      to[0]  = reinterpret_cast<const char*>(from)[0]; 
      to[1]  = reinterpret_cast<const char*>(from)[1]; 

     } 
     from_next = from; 
     to_next  = to; 

     return((to > to_limit)?partial:ok); 
    } 
}; 
+0

請注意,您的Facet實現到UCS-2而不是UTF-16的轉換。 UTF-16是一種可變長度編碼,稱爲代理對的儀器。 UCS-2是Unicode的一個子集,這就是UTF-16發明的原因。 – 2017-05-04 21:12:29

2

在使用wofstream和監守的wofstream轉換用值0A到2個字節0D 0A所有字節以上定義的UTF16面失敗窗口,這是不考慮你如何傳遞, '\ X0A' 的0A字節,L '\ X0A',L '\ x000A', '\ n',L '\ n' 和std :: ENDL都給予同樣的結果。 在Windows下你必須打開該文件以二進制方式使用ofstream(不是wofsteam)和寫輸出,就像它是在原崗位完成。

1

提供的Utf16Facet沒有在大字符串gcc中工作,這裏是我工作的版本...這種方式的文件將被保存在UTF-16LE。對於UTF-16BE,只需將do_indo_out中的分配顛倒過來即可。 to[0] = from[1]to[1] = from[0]

#include <locale> 
#include <bits/codecvt.h> 


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> 
{ 
    typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType; 
    typedef MyType::state_type   state_type; 
    typedef MyType::result    result; 


    /* This function deals with converting data from the input stream into the internal stream.*/ 
    /* 
    * from, from_end: Points to the beginning and end of the input that we are converting 'from'. 
    * to, to_limit: Points to where we are writing the conversion 'to' 
    * from_next:  When the function exits this should have been updated to point at the next location 
    *     to read from. (ie the first unconverted input character) 
    * to_next:   When the function exits this should have been updated to point at the next location 
    *     to write to. 
    * 
    * status:   This indicates the status of the conversion. 
    *     possible values are: 
    *     error:  An error occurred the bad file bit will be set. 
    *     ok:   Everything went to plan 
    *     partial: Not enough input data was supplied to complete any conversion. 
    *     nonconv: no conversion was done. 
    */ 
    virtual result do_in(state_type &s, 
          const char *from,const char *from_end,const char* &from_next, 
          wchar_t  *to, wchar_t *to_limit,wchar_t* &to_next) const 
    { 

     for(;from < from_end;from += 2,++to) 
     { 
      if(to<=to_limit){ 
       (*to)        = L'\0'; 

       reinterpret_cast<char*>(to)[0] = from[0]; 
       reinterpret_cast<char*>(to)[1] = from[1]; 

       from_next = from; 
       to_next  = to; 
      } 
     } 

     return((to != to_limit)?partial:ok); 
    } 



    /* This function deals with converting data from the internal stream to a C/C++ file stream.*/ 
    /* 
    * from, from_end: Points to the beginning and end of the input that we are converting 'from'. 
    * to, to_limit: Points to where we are writing the conversion 'to' 
    * from_next:  When the function exits this should have been updated to point at the next location 
    *     to read from. (ie the first unconverted input character) 
    * to_next:   When the function exits this should have been updated to point at the next location 
    *     to write to. 
    * 
    * status:   This indicates the status of the conversion. 
    *     possible values are: 
    *     error:  An error occurred the bad file bit will be set. 
    *     ok:   Everything went to plan 
    *     partial: Not enough input data was supplied to complete any conversion. 
    *     nonconv: no conversion was done. 
    */ 
    virtual result do_out(state_type &state, 
          const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next, 
          char   *to, char   *to_limit, char*   &to_next) const 
    { 

     for(;(from < from_end);++from, to += 2) 
     { 
      if(to <= to_limit){ 

       to[0]  = reinterpret_cast<const char*>(from)[0]; 
       to[1]  = reinterpret_cast<const char*>(from)[1]; 

       from_next = from; 
       to_next  = to; 
      } 
     } 

     return((to != to_limit)?partial:ok); 
    } 
}; 
6

如果使用C++11標準(因爲有很多附加的包括像"utf8"它永遠解決了這個問題),這是很容易。

但是,如果你想使用多平臺的代碼與舊標準,您可以使用此方法與流寫:

  1. Read the article about UTF converter for streams
  2. 來源添加stxutif.h到項目上面
  3. 以ANSI模式打開文件,並將BOM添加到文件的開頭,如下所示:

    std::ofstream fs; 
    fs.open(filepath, std::ios::out|std::ios::binary); 
    
    unsigned char smarker[3]; 
    smarker[0] = 0xEF; 
    smarker[1] = 0xBB; 
    smarker[2] = 0xBF; 
    
    fs << smarker; 
    fs.close(); 
    
  4. 然後打開該文件作爲UTF還有寫你的內容:

    std::wofstream fs; 
    fs.open(filepath, std::ios::out|std::ios::app); 
    
    std::locale utf8_locale(std::locale(), new utf8cvt<false>); 
    fs.imbue(utf8_locale); 
    
    fs << .. // Write anything you want...