2014-09-30 65 views
1

XML文件:如何使用XML/SGML實體將UTF-16轉換爲ASCII/ANSI?

<?xml version="1.0" encoding="utf-8"?> 
<response> 
<center> 
<b>Need to decode this -> </b> 
</center> 
</response> 

我當前的代碼:

procedure TForm1.Button1Click(Sender: TObject); 
var 
    Doc: IXMLDocument; 
    S: AnsiString; 
    SW: WideString; 
    I: Integer; 
begin 
    Doc := TXMLDocument.Create(nil); 
    Doc.LoadFromFile('example.xml'); 
    SW := Doc.DocumentElement.ChildNodes['center'].ChildNodes['b'].NodeValue; 
    S := ''; 
    for I := 1 to Length(SW) do 
    if Ord(SW[I]) > $04FF then 
     S := S + IntToHex(Ord(SW[I]), 4) + ' ' 
    else 
     S := S + SW[I]; 
    Memo1.Text := s; 
end; 

SW在UTF-16(WideString的)進行編碼,幷包含該字符序列#$D83D#$DE09,但我需要它作爲一個XML/SGML實體像'&#128521;'。我如何編碼?

使用的字符是這樣的:http://www.fileformat.info/info/unicode/char/1f609/index.htm

+1

不是真的明確。所以基本上,您不滿意XML DOM實現如何解碼基本多語言平面以外的字符並將其編碼爲兩個UTF-16單元?並且想要將它重新編碼爲SGML字符實體? – 2014-09-30 02:29:16

+0

真的不明白,我忘了添加XML文檔...我現在將它添加 – user3802199 2014-09-30 02:32:04

+0

添加XML文檔 – user3802199 2014-09-30 02:33:53

回答

0

使用ANSI德爾福你必須手動處理UTF-16代理對(或使用一些第三方庫)。

這應該在ANSI UND Unicode的德爾福工作:

uses 
    {$IFDEF UNICODE} 
    Xml.XMLDoc, Xml.XMLIntf, System.AnsiStrings, System.Character; 
    {$ELSE} 
    XMLDoc, XMLIntf; 
    {$ENDIF} 

{$R *.dfm} 

type 
{$IFDEF UNICODE} 
    ValueString = UnicodeString; 
{$ELSE} 
    ValueString = WideString; 
{$ENDIF} 

procedure Check(ATrue: Boolean; const AMessage: string); 
begin 
    if not ATrue then 
    raise Exception.Create(AMessage); 
end; 

function IsHighSurrogate(AChar: WideChar): Boolean; 
begin 
{$IFDEF UNICODE} 
    Result := TCharacter.IsHighSurrogate(AChar); 
{$ELSE} 
    Result := (AChar >= #$D800) and (AChar <= #$DBFF); 
{$ENDIF} 
end; 

function ConvertToUtf32(AHigh, ALow: WideChar): Integer; 
begin 
    {$IFDEF UNICODE} 
    Result := Ord(TCharacter.ConvertToUtf32(AHigh, ALow)); 
    {$ELSE} 
    Check(AHigh >= #$D800, 'Invalid high surrogate code point'); 
    Check(AHigh <= #$DBFF, 'Invalid high surrogate code point'); 
    Check(ALow >= #$DC00, 'Invalid low surrogate code point'); 
    Check(ALow <= #$DFFF, 'Invalid low surrogate code point'); 
    // This will return the ordinal value of the Unicode character represented by the two surrogate code points 
    Result := $010000 + ((Ord(AHigh) - $D800) shl 10) or (Ord(ALow) - $DC00); 
    {$ENDIF} 
end; 

function MakeEntity(AValue: Integer): AnsiString; 
begin 
    Result := Format(AnsiString('&#%d;'), [AValue]); 
end; 

function UnicodeToAsciiWithEntities(const AInput: ValueString): AnsiString; 
var 
    C: WideChar; 
    I: Integer; 
begin 
    Result := ''; 
    I := 1; 
    while I <= Length(AInput) do 
    begin 
    C := AInput[I]; 
    if C < #$0080 then 
     Result := Result + AnsiChar(C) 
    else 
    if IsHighSurrogate(C) then 
    begin 
     Check((I + 1) <= Length(AInput), 'String truncated after high surrogate'); 
     Result := Result + MakeEntity(ConvertToUtf32(C, AInput[I + 1])); 
     // Skip low surrogate 
     Inc(I); 
    end 
    else 
     Result := Result + MakeEntity(Ord(C)); 
    Inc(I); 
    end; 
end; 

procedure TForm1.Button1Click(Sender: TObject); 
begin 
    Memo1.Lines.Text := string(UnicodeToAsciiWithEntities(LoadXMLDocument(
    'example.xml').DocumentElement.ChildNodes['center'].ChildNodes['b'].NodeValue 
)); 
end; 

我沒有德爾福7在這裏,所以一些小的調整可能是必要的,該代碼在XE2和2007年

+0

XML文檔聲明其編碼爲UTF-8 – 2014-09-30 16:13:31

+0

而不是轉換整個XML內容'UCS4String'和廢物2-4x的記憶,我會離開它,因爲'UnicodeString',並通過它只是循環尋找替代品和轉換他們在需要的時候去實體。查看'System.Character'函數,如'IsSurrogatePair()'和'ConvertToUtf32()'。 – 2014-09-30 17:51:12

+0

@DavidHeffernan確實如此,但這並不重要,因爲無論如何,XML解析器將其轉換爲Delphi的內部表示形式(WideString for Delphi 7),不是嗎? – 2014-10-01 09:02:38