PHP DOM UTF-8問題

首先，我的數據庫使用Windows-1250作爲原生字符集。我輸出的數據爲UTF-8。我在我的網站上使用iconv（）函數將Windows-1250字符串轉換爲UTF-8字符串，並且它非常完美。PHP DOM UTF-8問題

問題是，當我使用PHP DOM來解析存儲在數據庫中的HTML（HTML是WYSIWYG編輯器的輸出並且無效，它沒有html，頭部，主體標籤等）。

的HTML可能看起來像這一點，例如：

<p>Hello</p>

下面是我用從數據庫解析某些HTML的方法：

private function ParseSlideContent($slideContent) 
{ 
     var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters 

    $doc = new DOMDocument('1.0', 'UTF-8'); 

    // hack to preserve UTF-8 characters 
    $html = iconv('Windows-1250', 'UTF-8', $slideContent); 
    $doc->loadHTML('<?xml encoding="UTF-8">' . $html); 
    $doc->preserveWhiteSpace = false; 

    foreach($doc->getElementsByTagName('img') as $t) { 
    $path = trim($t->getAttribute('src')); 
    $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path)); 
    } 
    foreach ($doc->getElementsByTagName('object') as $o) { 
    foreach ($o->getElementsByTagName('param') as $p) { 
    $path = trim($p->getAttribute('value')); 
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path)); 
    } 
    } 
    foreach ($doc->getElementsByTagName('embed') as $e) { 
    if (true === $e->hasAttribute('pluginspage')) { 
    $path = trim($e->getAttribute('src')); 
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path)); 
    } else { 
    $path = end(explode('data/media/video/', trim($e->getAttribute('src')))); 
    $path = 'data/media/video/' . $path; 
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path); 
    $width = $e->getAttribute('width') . 'px'; 
    $height = $e->getAttribute('height') . 'px'; 
    $a = $doc->createElement('a', ''); 
    $a->setAttribute('href', $path); 
    $a->setAttribute('style', "display:block;width:$width;height:$height;"); 
    $a->setAttribute('class', 'player'); 
    $e->parentNode->replaceChild($a, $e); 
    $this->slideContainsVideo = true; 
    } 
    } 

    $html = trim($doc->saveHTML()); 

    $html = explode('<body>', $html); 
    $html = explode('</body>', $html[1]); 
    return $html[0]; 
}

從上述方法的輸出是一個垃圾，所有的特殊字符都替換爲怪異的東西，像ÄÄÄÄÄ。

還有一件事。它確實在我的開發服務器上工作。

雖然它在生產服務器上不起作用。

有什麼建議嗎？

生產服務器的PHP版本：PHP版本5.2.0RC4-dev的

PHP開發服務器版本：PHP 5.2.13版本

UPDATE：

我自己研究解決方案。我從這個PHP錯誤報告中得到了靈感（不是真的是個bug）：http://bugs.php.net/bug.php?id=32547

這是我提出的解決方案。我會嘗試明天，讓你知道，如果它的工作原理：

private function ParseSlideContent($slideContent) 
{ 
     var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters 

    $doc = new DOMDocument('1.0', 'UTF-8'); 

    // hack to preserve UTF-8 characters 
    $html = iconv('Windows-1250', 'UTF-8', $slideContent); 
    $doc->loadHTML('<?xml encoding="UTF-8">' . $html); 
    $doc->preserveWhiteSpace = false; 

    // this might work 
    // it basically just adds head and meta tags to the document 
    $html = $doc->getElementsByTagName('html')->item(0); 
    $head = $doc->createElement('head', ''); 
    $meta = $doc->createElement('meta', ''); 
    $meta->setAttribute('http-equiv', 'Content-Type'); 
    $meta->setAttribute('content', 'text/html; charset=utf-8'); 
    $head->appendChild($meta); 
    $body = $doc->getElementsByTagName('body')->item(0); 
    $html->removeChild($body); 
    $html->appendChild($head); 
    $html->appendChild($body); 

    foreach($doc->getElementsByTagName('img') as $t) { 
    $path = trim($t->getAttribute('src')); 
    $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path)); 
    } 
    foreach ($doc->getElementsByTagName('object') as $o) { 
    foreach ($o->getElementsByTagName('param') as $p) { 
    $path = trim($p->getAttribute('value')); 
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path)); 
    } 
    } 
    foreach ($doc->getElementsByTagName('embed') as $e) { 
    if (true === $e->hasAttribute('pluginspage')) { 
    $path = trim($e->getAttribute('src')); 
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path)); 
    } else { 
    $path = end(explode('data/media/video/', trim($e->getAttribute('src')))); 
    $path = 'data/media/video/' . $path; 
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path); 
    $width = $e->getAttribute('width') . 'px'; 
    $height = $e->getAttribute('height') . 'px'; 
    $a = $doc->createElement('a', ''); 
    $a->setAttribute('href', $path); 
    $a->setAttribute('style', "display:block;width:$width;height:$height;"); 
    $a->setAttribute('class', 'player'); 
    $e->parentNode->replaceChild($a, $e); 
    $this->slideContainsVideo = true; 
    } 
    } 

    $html = trim($doc->saveHTML()); 

    $html = explode('<body>', $html); 
    $html = explode('</body>', $html[1]); 
    return $html[0]; 
}

來源

2010-08-23 Richard Knop

您是否確定要發送適當的Content-type標頭？即如果您在Firefox中打開該頁面，請檢查View-> Charset Encoding是否設置爲UTF8。 – 2010-08-23 15:17:03

你有沒有試過保存方法：$ doc-> save（）; – 2010-08-23 15:54:28

@Cem我會試試看。等幾分鐘。 – 2010-08-23 16:33:40

你的「黑客」沒有意義。

您正在將Windows-1250 HTML文件轉換爲UTF-8，然後預先計劃<?xml encoding="UTF-8">。這不起作用。 DOM擴展，用於HTML文件：

將meta-equiv中指定的字符集指定爲「content-type」。
否則假定ISO-8859-1

我建議你，而不是從Windows-1250轉換成ISO-8859-1和前面加上什麼。

編輯建議不是很好，因爲Windows-1250的字符不在ISO-8859-1中。既然你有片段處理不meta元素的內容類型，你可以添加自己的給力解釋爲UTF-8：

<?php 
//script and output are in UTF-8 

/* Simulate HTML fragment in Windows-1250 */ 
$html = <<<XML 
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p> 
XML; 
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert 

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */ 
$htmlInterm = 
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" . 
    iconv("Windows-1250", "UTF-8", $htmlInterm); 

/* Omit libxml warnings */ 
libxml_use_internal_errors(true); 

/* Build DOM */ 
$d = new domdocument; 
$d->loadHTML($htmlInterm); 
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

給出：

 
string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

來源

2010-08-23 18:43:39 Artefacto

如果你使用過非英文數據（cp1250或其他），你會知道這種黑客攻擊有時是使PHP DOM保留UTF-8特殊字符的唯一方法。它也在PHP文檔中提到。您可以嘗試製作一個cp1250數據庫，從那裏獲取一些數據並使用PHP DOM解析數據。這是一個真正的痛苦。 – 2010-08-23 19:43:00

@Rich「它也在PHP文檔中提到。」請鏈接。用戶註釋不是文檔的一部分。 – Artefacto 2010-08-23 19:55:18

@Artefacto這裏是一個用戶評論（http://www.php.net/manual/en/domdocument.loadhtml.php）。這是來自頂端的第三個評論。我知道這不是官方的，但它有時是唯一的方法。這不是Windows-1250 + PHP DOM組合給我頭痛的唯一時間。儘管如此，我只是睡了一會兒，我對如何解決這個問題有了一個想法（不知道它會幹什麼）。如果它不起作用，我會在明天嘗試它。我可能會爲這個問題開始賞金。 – 2010-08-23 20:04:22

兩個解決方案。

您可以設置編碼爲標題：

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

或者你可以將它作爲一個META標籤：

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

編輯：在這兩個都正確設置事件，請執行以下操作：

創建小頁，有一個UTF-8字符在裏面。
用與您已有的相同方法書寫頁面。
使用Fiddler或Wireshark檢查DEV和PROD環境中傳輸的原始字節。您也可以使用Fiddler/Wireshark再次檢查標題。

如果您確信正在發送正確的標題，那麼找到錯誤的最佳機會是開始查看原始字節。發送到相同瀏覽器的相同字節將產生相同的結果，因此您需要開始尋找它們不相同的原因。 Fiddler/Wireshark將爲此提供幫助。

來源

2010-08-23 15:21:32 riwalk

我不認爲這將解決問題，如果它真的與var_dump – 2010-08-23 15:53:47

一起工作他提到它在他的開發服務器上工作，這意味着它很可能正在被寫入正確的字節。從那裏最可能的問題是字節沒有被正確讀取，並且這應該解決這個問題。 – riwalk 2010-08-23 16:00:37

標題發送正確。還有正確的元標記。 – 2010-08-23 16:14:41

我有同樣的問題。我的修復程序使用記事本++，並將php文檔的編碼設置爲「沒有BOM的UTF-8」。希望這可以幫助其他人。

來源

2013-08-18 21:03:16 user2494874

PHP DOM UTF-8問題

回答

相關問題