我想將html文件轉換爲xml。它正在大部分工作。我遇到的問題是鏈接。現在,它似乎完全忽略了我的測試文件中的鏈接。嘗試將HTML轉換爲XML時的鏈接問題
下面是轉換代碼:
<?php
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
function convertToXML()
{
$titleLength = 35;
$output = "";
$date = date("D, j M Y G:i:s T");
$fi = fopen("../newsTEST.htm", "r");
$fo = fopen("../newsfeed.xml", "w");
//This is the first parts of the XML
$output .= "<?xml version=\"1.0\"?>\n";
$output .= "<rss version=\"2.0\">\n";
$output .= "<channel>\n";
$output .= "\t<title>Wiggle 100 News</title>\n";
$output .= "\t<link>http://www.wiggle100.com/news.php</link>\n";
$output .= "\t<description>Wiggle 100 Daily News</description>\n";
$output .= "\t<language>en-us</language>\n";
$output .= "\t<pubDate>". $date ."</pubDate>\n";
$output .= "\t<managingEditor>[email protected]</managingEditor>\n";
$output .= "\t<webMaster>[email protected]</webMaster>\n";
$article = "";
$skip = true; //if false will continue to put lines into output until </p>
$newArticle = false;
while(!feof($fi))
{
$line = fgets($fi);
$link = "";
if(strpos($line, "<p") !== false)
{
$pos = strpos($line, "<p");
$line = substr($line, $pos);
$pos = strpos($line, ">");
$line = substr($line, $pos + 1);
$skip = false;
}
if(strpos($line, "</p>") !== false)
{
$pos = strpos($line, "</p>");
$line = substr($line, 0, $pos - 1);
$newArticle = true;
}
//This adds the line to the article
if(!$skip)
{
$article .= $line;
}
//This mixes the article, title, link, and date with
// XML and puts it into the output
if($newArticle)
{
//This if is to get rid of stuff like <p> </p>
if((strlen($article) > 10))
{
$link = findLink($article);
//$article = strip_tags($article);
$title = substr($article, 0, $titleLength) . "...";
$output .= "\t<item>\n";
$output .= "\t\t<title>". $title ."</title>\n";
$output .= "\t\t<link>". $link ."</link>\n";
$output .= "\t\t<description>". $article . "</description>\n";
$output .= "\t\t<pubDate>". $date . "</pubDate>\n";
$output .= "\t</item>\n\n";
}
$article = "";
$line = "";
$skip = true;
}
}
$output .= "</channel>\n";
$output .= "</rss>\n";
fwrite($fo, $output);
fclose($fi);
fclose($fo);
echo "<br /><br /> News converted to XML";
}
//*****************************************************************************
//*****************************************************************************
//Find and return a link in the input.
//Else use the a default
function findLink($input)
{
$link = "http://www.wiggle100.com/news.php";
if(strpos($input, "<a") !== false)
{
$startpos = strpos($input, "href");
$link = substr($input, $startpos + 5);
$endpos = strpos($link, ">");
$link = substr($link, 0, $endpos - 2);
}
return $link;
}
?>
下面是HTML測試代碼:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>Test Page</title>
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812">
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head>
<body bgcolor="#ffffff">
<p> </p>
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p> </p>
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p> </p>
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font>
<a href="http://www.thedailyreview.com/news/">
http://www.thedailyreview.com/news/</a></p>
</body>
</html>
下面是XML輸出:
<rss version="2.0">
<channel>
<title>Wiggle 100 News</title>
<link>http://www.wiggle100.com/news.php</link>
<description>Wiggle 100 Daily News</description>
<language>en-us</language>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
<managingEditor>[email protected]</managingEditor>
<webMaster>[email protected]</webMaster>
<item>
<title>This is an article. Blah. Blah. Bla...</title>
<link>http://www.wiggle100.com/news.php</link>
<description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
<item>
<title>This is another article. Blah. Blah...</title>
<link>http://www.wiggle100.com/news.php</link>
<description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
<item>
<title>This is the 3rd article. Blah. Blah...</title>
<link>http://www.wiggle100.com/news.php</link>
<description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
<item>
<title><font size="6">This is the news for...</title>
<link>http://www.wiggle100.com/news.php</link>
<description><font size="6">This is the news for today. Blah Blah Blah!</font>
</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
</channel>
</rss>
font標籤將消失時我取消了strip_tags()的註釋。
而不是解析html作爲字符串,你可以在PHP中使用html解析器。 http://www.onderstekop.nl/articles/114/ – Xinus 2009-10-24 04:53:47
爲什麼投票? – 2009-10-24 22:58:16