2009-10-24 76 views
-1

我想將html文件轉換爲xml。它正在大部分工作。我遇到的問題是鏈接。現在,它似乎完全忽略了我的測試文件中的鏈接。嘗試將HTML轉換爲XML時的鏈接問題

下面是轉換代碼:

<?php 
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL); 

function convertToXML() 
{ 

    $titleLength = 35; 
    $output = ""; 
    $date = date("D, j M Y G:i:s T"); 
    $fi = fopen("../newsTEST.htm", "r"); 
    $fo = fopen("../newsfeed.xml", "w"); 

    //This is the first parts of the XML 
    $output .= "<?xml version=\"1.0\"?>\n"; 
    $output .= "<rss version=\"2.0\">\n"; 
    $output .= "<channel>\n"; 
    $output .= "\t<title>Wiggle 100 News</title>\n"; 
    $output .= "\t<link>http://www.wiggle100.com/news.php</link>\n"; 
    $output .= "\t<description>Wiggle 100 Daily News</description>\n"; 
    $output .= "\t<language>en-us</language>\n"; 
    $output .= "\t<pubDate>". $date ."</pubDate>\n"; 
    $output .= "\t<managingEditor>[email protected]</managingEditor>\n"; 
    $output .= "\t<webMaster>[email protected]</webMaster>\n"; 

    $article = ""; 
    $skip = true; //if false will continue to put lines into output until </p> 
    $newArticle = false; 

    while(!feof($fi)) 
    { 
     $line = fgets($fi); 
     $link = ""; 

     if(strpos($line, "<p") !== false) 
     { 
      $pos = strpos($line, "<p"); 
      $line = substr($line, $pos); 

      $pos = strpos($line, ">"); 
      $line = substr($line, $pos + 1); 

      $skip = false;   
     } 

     if(strpos($line, "</p>") !== false) 
     { 
      $pos = strpos($line, "</p>"); 
      $line = substr($line, 0, $pos - 1); 

      $newArticle = true; 
     } 

     //This adds the line to the article 
     if(!$skip) 
     { 
      $article .= $line; 
     } 

     //This mixes the article, title, link, and date with 
     // XML and puts it into the output 
     if($newArticle) 
     { 
      //This if is to get rid of stuff like <p>&nbsp;</p> 
      if((strlen($article) > 10)) 
      { 
       $link = findLink($article); 
       //$article = strip_tags($article); 
       $title = substr($article, 0, $titleLength) . "..."; 

       $output .= "\t<item>\n"; 
       $output .= "\t\t<title>". $title ."</title>\n"; 
       $output .= "\t\t<link>". $link ."</link>\n"; 
       $output .= "\t\t<description>". $article . "</description>\n"; 
       $output .= "\t\t<pubDate>". $date . "</pubDate>\n"; 
       $output .= "\t</item>\n\n"; 
      } 

      $article = ""; 
      $line = ""; 
      $skip = true; 
     } 
    } 

    $output .= "</channel>\n"; 
    $output .= "</rss>\n"; 

    fwrite($fo, $output); 

    fclose($fi); 
    fclose($fo); 

    echo "<br /><br /> News converted to XML"; 
} 

    //***************************************************************************** 
    //***************************************************************************** 

    //Find and return a link in the input. 
    //Else use the a default 
    function findLink($input) 
    { 
     $link = "http://www.wiggle100.com/news.php"; 

     if(strpos($input, "<a") !== false) 
     { 
      $startpos = strpos($input, "href"); 
      $link = substr($input, $startpos + 5); 
      $endpos = strpos($link, ">"); 
      $link = substr($link, 0, $endpos - 2); 
     } 
     return $link; 
    } 


?> 

下面是HTML測試代碼:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> 
<a href="http://www.thedailyreview.com/news/"> 
http://www.thedailyreview.com/news/</a></p> 
</body> 
</html> 

下面是XML輸出:

<rss version="2.0"> 
<channel> 
    <title>Wiggle 100 News</title> 
    <link>http://www.wiggle100.com/news.php</link> 
    <description>Wiggle 100 Daily News</description> 
    <language>en-us</language> 
    <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    <managingEditor>[email protected]</managingEditor> 
    <webMaster>[email protected]</webMaster> 
    <item> 
     <title>This is an article. Blah. Blah. Bla...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title>This is another article. Blah. Blah...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title>This is the 3rd article. Blah. Blah...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title><font size="6">This is the news for...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description><font size="6">This is the news for today. Blah Blah Blah!</font> 
</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

</channel> 
</rss> 

font標籤將消失時我取消了strip_tags()的註釋。

+3

而不是解析html作爲字符串,你可以在PHP中使用html解析器。 http://www.onderstekop.nl/articles/114/ – Xinus 2009-10-24 04:53:47

+0

爲什麼投票? – 2009-10-24 22:58:16

回答

0

的問題結束了,我從來沒有重置$ newArticle假寫入XML輸出之後。因此,在$ newArticle設置爲true後(發現</p>時),在輸出文章之前,讀取的行數不會超過一行。通過在寫入輸出後將$ newArticle設置爲false,程序會正確地向文章添加行,直到遇到</p>

1

我做了一些測試,發現它在輸入文件中的所有單行上的段落都能正常工作,如下例所示。 (除了它讀取左引號作爲URL的一部分,但是這很容易固定。)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"> http://www.thedailyreview.com/news/</a></p> 
</body> 
</html> 
+0

謝謝。這幫助我找到了問題。 – 2009-10-24 23:00:36