2012-07-09 44 views
3

我有一個包含兩種不同編碼的大文件。 「main」文件是UTF-8,但某些字符如is32xx(isoxxx中的)或<9F>(isoxxx中的)使用ISO-8859-1編碼。我可以用這個來代替無效字符:使用兩個單獨的編碼在Ruby上加載文件

string.encode("iso8859-1", "utf-8", {:invalid => :replace, :replace => "-"}).encode("utf-8") 

的問題是,我需要這個錯誤編碼的字符,所以更換爲「 - 」是沒用的我。我怎樣才能修復與紅寶石的文檔中錯誤的編碼字符?

編輯:我已經試過了:fallback選項,但沒有成功(其中再沒替換):

string.encode("iso8859-1", "utf-8", 
    :fallback => {"\x80" => "123"} 
) 
+0

備用將只有沒有其他選項。看到我之前發佈的鏈接。 – phoet 2012-07-10 07:45:32

+0

不,我已經嘗試了沒有額外的選項,並沒有工作:( – Fu86 2012-07-10 13:28:33

回答

1

我用下面的代碼(紅寶石1.8.7)。它測試每個char> = 128 ASCII以檢查它是否是有效utf-8序列的開始。如果不是,則認爲它是iso8859-1並將其轉換爲utf-8。

由於您的文件很大,所以此過程可能非常緩慢!

class String 
    # Grants each char in the final string is utf-8-compliant. 
    # based on http://php.net/manual/en/function.utf8-encode.php#39986 
    def utf8 
    ret = '' 

    # scan the string 
    # I'd use self.each_byte do |b|, but I'll need to change i 
    a = self.unpack('C*') 
    i = 0 
    l = a.length 
    while i < l 
     b = a[i] 
     i += 1 

     # if it's ascii, don't do anything. 
     if b < 0x80 
     ret += b.chr 
     next 
     end 

     # check whether it's the beginning of a valid utf-8 sequence 
     m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe] 
     n = 0 
     n += 1 until n > m.length || (b & m[n]) == m[n-1] 

     # if not, convert it to utf-8 
     if n > m.length 
     ret += [b].pack('U') 
     next 
     end 

     # if yes, check if the rest of the sequence is utf8, too 
     r = [b] 
     u = false 

     # n bytes matching 10bbbbbb follow? 
     n.times do 
     if i < l 
      r << a[i] 
      u = (a[i] & 0xc0) == 0x80 
      i += 1 
     else 
      u = false 
     end 
     break unless u 
     end 

     # if not, converts it! 
     ret += r.pack(u ? 'C*' : 'U*') 
    end 

    ret 
    end 

    def utf8! 
    replace utf8 
    end 
end 

# let s be the string containing your file. 
s2 = s.utf8 

# or 
s.utf8! 
+0

好吧,這可能工作,但這是真的嗎?這是解決這個問題的唯一解決方案嗎?這對修復一些不好的字符有很大的「開銷」 – Fu86 2012-07-10 07:36:42

+0

不幸的是,沒有測試每個壞字符都是不可能的,因爲它們可以是合法的utf8序列的一部分;所以,順便說一句,上面的代碼在1.9.3上不起作用;我正在考慮修復它。 – 2012-07-11 20:31:48

1

這是我以前的代碼的一個非常快的版本,與Ruby 1.8和1.9兼容。

我可以用正則表達式識別無效的utf8字符,我只轉換它們。

class String 

    # Regexp for invalid UTF8 chars. 
    # $1 will be valid utf8 sequence; 
    # $3 will be the invalid utf8 char. 
    INVALID_UTF8 = Regexp.new(
    '(([\xc0-\xdf][\x80-\xbf]{1}|' + 
    '[\xe0-\xef][\x80-\xbf]{2}|' + 
    '[\xf0-\xf7][\x80-\xbf]{3}|' + 
    '[\xf8-\xfb][\x80-\xbf]{4}|' + 
    '[\xfc-\xfd][\x80-\xbf]{5})*)' + 
    '([\x80-\xff]?)', nil, 'n') 

    if RUBY_VERSION >= '1.9' 
    # ensure each char is utf8, assuming that 
    # bad characters are in the +encoding+ encoding 
    def utf8_ignore!(encoding) 

     # avoid bad characters errors and encoding incompatibilities 
     force_encoding('ascii-8bit') 

     # encode only invalid utf8 chars within string 
     gsub!(INVALID_UTF8) do |s| 
     $1 + $3.force_encoding(encoding).encode('utf-8').force_encoding('ascii-8bit') 
     end 

     # final string is in utf-8 
     force_encoding('utf-8') 
    end 

    else 
    require 'iconv' 

    # ensure each char is utf8, assuming that 
    # bad characters are in the +encoding+ encoding 
    def utf8_ignore!(encoding) 

     # encode only invalid utf8 chars within string 
     gsub!(INVALID_UTF8) do |s| 
     $1 + Iconv.conv('utf-8', encoding, $3) 
     end 

    end 
    end 

end 

# "\xe3" = "ã" in iso-8859-1 
# mix valid with invalid utf8 chars, which is in iso-8859-1 
a = "ãb\xe3" 

a.utf8_ignore!('iso-8859-1') 

puts a #=> ãbã 
相關問題