PHP - 處理缺少分號的HTML實體

我試圖編寫腳本來解析遠程RSS提要，並以JSON格式輸出結果。PHP - 處理缺少分號的HTML實體

原始RSS提要包含HTML實體，如–,…等。

我的原始內容使用html_entity_decode第一，使json_encode會產生正確的輸出：

$rss = new DOMDocument(); 
$rss->load('https://www.example.com/feed'); 
$feed = array(); 
foreach ($rss->getElementsByTagName('item') as $node) { 
    $item = array ( 
     'title' => html_entity_decode($node->getElementsByTagName('title')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'), 
     'desc' => html_entity_decode($node->getElementsByTagName('description')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'), 
     'link' => $node->getElementsByTagName('link')->item(0)->nodeValue, 
     'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue, 
    ); 
    $feed[] = $item; 
} 
$data = array(); 
foreach($feed as $item){ 
    $data[] = array('url'=>$item['link'],'date'=>date('l, F d, Y g:i A',strtotime($item['date'])),'title'=>$item['title'],'desc'=>$item['desc']); 
} 
echo json_encode($data);

它運作良好，除了一些的HTML實體是缺少分號。 html_entity_decode將不會識別它們。

我在想，也許我可以使用正則表達式來查找和修復那些沒有分號的實體。但我不知道如何編寫這樣的代碼。任何想法？

或者還有其他方法可以解決這個問題嗎？

來源

2016-09-28 Shawn

一些樣本將幫助！ –

到目前爲止，我看到'–'和'…'。有時他們有分號。有時候不是。 – Shawn

看來你只是想匹配&#後面跟着4位數字，沒有跟在;之後。使用

'~&#\d{4}(?!;)~'

和與$0;的關係。請參閱regex demo。

詳細：

&# - 字面序列&#
\d{4} - 4位數
(?!;) - 失敗的比賽，如果有4個數字之後立即;負前瞻。

替換模式中的$0是對整個匹配值的反向引用。

PHP代碼片段：

$re = '~&#\d{4}(?!;)~'; 
$str = '&#8211&#8210&#8211;&#8211;'; 
$subst = '$0;'; 
$result = preg_replace($re, $subst, $str);

來源

2016-09-28 18:56:12

完美的作品！ – Shawn

preg_replace("/&#(\d{4})(?!;)/i", "&#$1;", $item['desc']);

來源

2016-09-28 19:04:25 Mark

請在答案中添加一些文字或解釋，使其更易於理解。 –

PHP - 處理缺少分號的HTML實體

回答

相關問題