2014-10-28 119 views
3

我有UTF-16十六進制表示,如「0633064406270645」,它是阿拉伯語中的「سلام」。UTF16十六進制到文本

我想將其轉換爲相應的文本。 PostgreSQL有直接的方法嗎?

我可以像下面那樣轉換UTF代碼點;不幸的是,它似乎不支持UTF16。任何想法如何在PostgreSQL中做到這一點,最糟糕的情況下,我會寫一個函數?

SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8'); 

"سلام" 

SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16'); 

ERROR: invalid source encoding name "UTF16" 
********** Error ********** 
+0

謝謝你們,這兩種技術都很好。功能似乎比使用Unicode轉義序列更準確。對於我的申請,不要求準確性,所以這兩種技術都可以。 – Adam 2014-10-28 15:02:54

回答

2

沒錯,Postgres不支持UTF-16。

但是,它支持Unicode escape sequences

SELECT U&'\0633\0644\0627\0645' 

但是請記住,Unicode碼點UTF-16代碼單元僅在Basic Multilingual Plane等價的。換句話說,如果您有任何跨越多個16位代碼單元的UTF-16字符,則需要將它們自己翻譯爲相應的代碼點。

2

convert_from(或PostgreSQL一般)不支持UTF-16,但是您可以使用其中一種可選語言。

實施例plperlu(需要數據庫超級用戶權限來創建功能,和CREATE LANGUAGE plperlu如果尚未創建):

CREATE FUNCTION decode_utf16(text) RETURNS text AS $$ 
    require Encode; 
    return Encode::decode("UTF-16BE", pack("H*", $_[0])); 
$$ immutable language plperlu; 

=> select decode_utf16('0633064406270645'); 

decode_utf16 
-------------- 
سلام 
1

PostgreSQL不支持UTF-16本身。我建議您在將數據提供給數據庫之前將其轉換爲UTF-8。如果一切都太遲了(錯誤的數據已經存在於您的數據庫),你可以使用這些維護功能將數據從UTF-16(邏輯從wikipedia複製)轉換:

-- convert from bytea, containing UTF-16-BE data 
CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?') 
    RETURNS text 
    LANGUAGE sql 
    IMMUTABLE 
    STRICT 
AS $function$ 
WITH source(unit) AS (
    SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1) 
    FROM generate_series(0, octet_length(utf16_data) - 2, 2) i 
), 
codes(lag, unit, lead) AS (
    SELECT lag(unit, 1) OVER(), unit, lead(unit, 1) OVER() 
    FROM source 
) 
SELECT string_agg(CASE 
    WHEN unit >= 56320 AND unit <= 57343 THEN CASE 
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed 
    ELSE invalid_replacement 
    END 
    WHEN unit >= 55296 AND unit <= 56319 THEN CASE 
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888) 
    ELSE invalid_replacement 
    END 
    ELSE chr(unit) 
END, '') 
FROM codes 
$function$; 

-- convert from bytea, containing UTF-16-LE data 
CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?') 
    RETURNS text 
    LANGUAGE sql 
    IMMUTABLE 
    STRICT 
AS $function$ 
WITH source(unit) AS (
    SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8) 
    FROM generate_series(0, octet_length(utf16_data) - 2, 2) i 
), 
codes(lag, unit, lead) AS (
    SELECT lag(unit, 1) OVER(), unit, lead(unit, 1) OVER() 
    FROM source 
) 
SELECT string_agg(CASE 
    WHEN unit >= 56320 AND unit <= 57343 THEN CASE 
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed 
    ELSE invalid_replacement 
    END 
    WHEN unit >= 55296 AND unit <= 56319 THEN CASE 
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888) 
    ELSE invalid_replacement 
    END 
    ELSE chr(unit) 
END, '') 
FROM codes 
$function$; 

-- convert from bytea, containing UTF-16 data (with or without BOM) 
CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?') 
    RETURNS text 
    LANGUAGE sql 
    IMMUTABLE 
    STRICT 
AS $function$ 
SELECT CASE COALESCE(octet_length(utf16_data), 0) 
    WHEN 0 THEN '' 
    WHEN 1 THEN invalid_replacement 
    ELSE CASE substring(utf16_data FOR 2) 
    WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement) 
    ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement) 
    END 
END 
$function$; 

有了這些功能,您可以從所有轉換類型的UTF-16:

SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')), 
     convert_from_utf16le(decode('330644062706450652D862DF', 'hex')), 
     convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')), 
     convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex')); 

-- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16 
------------------------+----------------------+--------------------+------------------- 
-- سلام    | سلام    | سلام    | سلام