PHP的html_entity_decode和HTML <a>標籤

我想使用MediaWiki的API獲取XML格式的文章，並將它們包括在我的頁面上。我創建了一個簡單的代碼，它基本上使用?action=parse&page=Page_Name&format=xml請求獲取文章的XML表示形式。代碼如下：PHP的html_entity_decode和HTML <a>標籤

if($_GET["page"]=='') die("Page not specified (possibly direct call)"); 
$pagename = $_GET["page"]; 
$handle = @fopen("mediawiki/api.php?action=parse&page=".$pagename."&format=xml", "r"); 
if ($handle) { 
     while (!feof($handle)) { 
     $buffer = $buffer.fgets($handle); 
     }  
    $buffer = html_entity_decode($buffer); 
    /* 
    echo $buffer; 
    */ 
    $xml = simplexml_load_string($buffer); 
    foreach($xml->parse->children() as $child){ 
     switch($child->getName()){ 
      case "text": 
       echo $child->asXML()."<br/>"; 
       break; 
      case "categories": 
       echo "<h3>Categories this project is related to: </h3><br/>"; 
       foreach($child->children() as $grandChild){ 
        echo $grandChild." | "; 
       } 
       break; 
     } 
    } 
    fclose($handle); 
}

現在的問題是，我越來越奇怪的輸出。任何<a name="" href=""></a>變成<a name="" href=""/>，這使得所有以下文本成爲一個鏈接（我猜想，因爲沒有結束標籤</a>）。在Mozilla Firefox和Google Chrome中都可以看到這一點。我懷疑$buffer = html_entity_decode($buffer);導致此問題。是否有一個html_entity_decode();的參數，我應該指定以避免這種情況？是否由我的代碼中的其他錯誤或html_entity_decode();錯誤引起？

（要查看維基的API的XML輸出，你可以嘗試http://en.wikipedia.org/w/api.php?action=parse&page=No_Such_Page&format=xml不同page參數）

可能的解決方案：我不想去JSON，因爲喬丹的建議，所以我來了此解決方案。我只是將html_entity_decode移至case "text":區塊。所以現在我在那裏echo html_entity_decode($child->asXML())."<br/>";。你認爲這足夠可行嗎？

來源

2009-12-11 Azimuth

@Azimuth，你去了！將其粘貼到textarea中，選擇它並按下Ctrl-K縮進它所有4個空格（或者在該代碼的情況下，相關的東西已經縮進4個空格，所以我只是複製並粘貼它） – 2009-12-11 16:55:44

''是一個空的元素，在XML中可以自行關閉到''. – 2009-12-11 16:56:47

@Dominic is that the browsers' problem then? Because as I wrote both FF and Chrome output it so that all text becomes a link... Thanks for putting the code – Azimuth 2009-12-11 16:58:25

問題不在於html_entity_decode()。問題在於SimpleXML將<text>元素的內容視爲XML而不是文本。默認情況下，SimpleXML壓縮空元素（<a></a>至<a />）。解決此問題的一種方法是將SimpleXML對象導入DOM對象，並在saving the output時使用LIBXML_NOEMPTYTAG選項。這個選項的問題是，任何<br />元素將被輸出爲<br></br>。

更簡單的選擇是使用API的不同響應格式。我建議使用json響應格式並使用json_decode()函數來解析響應。

來源

2009-12-11 17:02:08

感謝您的回答。我想你是對的。 – Azimuth 2009-12-11 17:17:46

這不是奇怪的輸出，這是有效的XML。當你有一個空的標籤，XML允許您使用短收盤語法並不總是在HTML或XHTML有效

<foo></foo> 
<foo />

的html_entity_decode();功能轉換的HTML實體，如

&gt; converts to 
>

你需要後期處理您的XML片段並將其轉換爲適當的HTML。最簡單的方法是使用DomDocument API。

$foo = new DomDocument(); 
$foo->loadHtml('<p> Testing <a href="" /> </p>'); 
echo $foo->saveHtml();

這將採取一個XML片段，並將其轉換爲HTML文檔，其中包括修復所有自閉標記。您仍然需要解析出<body/>中的內容，但這比自己修復所有自閉標籤要容易得多。

來源

2009-12-11 17:03:59

@Alan，請閱讀我對第一個答案的評論 – Azimuth 2009-12-11 17:09:04

PHP的html_entity_decode和HTML <a>標籤

回答

相關問題