如何將unicode代碼點轉換爲十六進制HTML實體？

我有一個數據文件（準確地說是Apple plist），它有Unicode codepoints，如\U00e8和\U2019。我需要使用PHP將它們轉換爲有效的十六進制HTML entities。如何將unicode代碼點轉換爲十六進制HTML實體？

我在做什麼，現在是長長的一串：

$fileContents = str_replace("\U00e8", "&#xe8;", $fileContents); 
$fileContents = str_replace("\U2019", "&#x2019;", $fileContents);

這顯然是可怕的。我可以使用一個正則表達式將\U和所有尾隨的0s轉換爲&#x，然後粘在尾隨的;上，但這看起來也很笨拙。

是否有一種乾淨，簡單的方法來取一個字符串，並將所有的unicode代碼點替換爲HTML實體？

來源

2010-08-13 Tina Marie

PCRE正則表達式非常快速和安全;我會使用它們。（其他的官方解決方案也可能使用正則表達式，或者查找表，這是你現在擁有的。） – MvanGeest 2010-08-13 19:30:29

根據[本頁]（http://code.google.com/p/networkpx/wiki/PlistSpec）），那些轉義序列表示UTF-16代碼單元，而不是Unicode代碼點。這意味着您可能必須將兩個連續的代碼單元（如果它們形成代理對）組合成一個HTML實體。 – Artefacto 2010-08-13 21:30:56

您可以使用preg_replace：

preg_replace('/\\\\U0*([0-9a-fA-F]{1,5})/', '&#x\1;', $fileContents);

測試RE：

PS> 'some \U00e8 string with \U2019 embedded Unicode' -replace '\\U0*([0-9a-f]{1,5})','&#x$1;' 
some &#xe8; string with &#x2019; embedded Unicode

來源

2010-08-13 19:34:03 Joey

似乎是一個明確的正則表達式用例。 @Tina Marie，如果您需要更多plist處理，請查看http://code.google.com/p/cfpropertylist/。 – 2010-08-13 19:37:12

是的，使用CFPropertyList。很棒！ – 2010-08-13 19:52:24

這裏有一個正確的答案，這與事實的是代碼單元，而不是代碼點交易，並允許unencoding補充字符。

function unenc_utf16_code_units($string) { 
    /* go for possible surrogate pairs first */ 
    $string = preg_replace_callback(
     '/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i', 
     function ($matches) { 
      $hi_surr = hexdec($matches[1]); 
      $lo_surr = hexdec($matches[2]); 
      $scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) | 
       ($lo_surr & 0x3FF)); 
      return "&#x" . dechex($scalar) . ";"; 
     }, $string); 
    /* now the rest */ 
    $string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i', 
     function ($matches) { 
      //just to remove leading zeros 
      return "&#x" . dechex(hexdec($matches[1])) . ";"; 
     }, $string); 
    return $string; 
}

來源

2010-08-24 00:41:18 Artefacto

如何將unicode代碼點轉換爲十六進制HTML實體？

回答

相關問題