2010-04-29 53 views
0

我不解析此網址:http://foldmunka.netDOMDocument類無法訪問れ

$ch = curl_init("http://foldmunka.net"); 

//curl_setopt($ch, CURLOPT_NOBODY, true); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
//curl_setopt($ch, CURLOPT_HEADER, true); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here) 
$data = curl_exec($ch); 
$info = curl_getinfo($ch); 
curl_close($ch); 
clearstatcache(); 
if ($data === false) { 
    echo 'cURL failed'; 
    exit; 
} 
$dom = new DOMDocument(); 
$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8"); 
$data = preg_replace('/<\!\-\-\[if(.*)\]>/', '', $data); 
$data = str_replace('<![endif]-->', '', $data); 
$data = str_replace('<!--', '', $data); 
$data = str_replace('-->', '', $data); 
$data = preg_replace('@<script[^>]*?>.*?</script>@si', '', $data); 
$data = preg_replace('@<style[^>]*?>.*?</style>@si', '', $data); 

$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8"); 
@$dom->loadHTML($data); 

$els = $dom->getElementsByTagName('*'); 
foreach($els as $el){ 
    print $el->nodeName." | ".$el->getAttribute('content')."<hr />"; 
    if($el->getAttribute('title'))$el->nodeValue = $el->getAttribute('title')." ".$el->nodeValue; 
    if($el->getAttribute('alt'))$el->nodeValue = $el->getAttribute('alt')." ".$el->nodeValue; 
    print $el->nodeName." | ".$el->nodeValue."<hr />"; 
} 

我需要順序中高音,所有權屬性和簡單的文字,但這個頁面我不能在body標籤中訪問節點。

回答

1

這裏是用的DomDocument和DOMXPath的溶液。它比使用簡單HTML DOM解析器的其他解決方案短得多,運行速度更快(約〜100ms,對〜2300ms)。

<?php 

function makePlainText($source) 
{ 
    $dom = new DOMDocument(); 
    $dom->loadHtmlFile($source); 

    // use this instead of loadHtmlFile() to load from string: 
    //$dom->loadHtml('<html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click</a> Some text.</body></html>'); 

    $xpath = new DOMXPath($dom); 

    $plain = ''; 

    foreach ($xpath->query('//text()|//a|//img') as $node) 
    { 
     if ($node->nodeName == '#cdata-section') 
      continue; 

     if ($node instanceof DOMElement) 
     { 
      if ($node->hasAttribute('alt')) 
       $plain .= $node->getAttribute('alt') . ' '; 
      if ($node->hasAttribute('title')) 
       $plain .= $node->getAttribute('title') . ' '; 
     } 
     if ($node instanceof DOMText) 
      $plain .= $node->textContent . ' '; 
    } 

    return $plain; 
} 

echo makePlainText('http://foldmunka.net'); 
+0

如果有人知道如何用xpath查詢過濾出cdata-section,請對其進行評論。 – 2010-11-20 00:18:46

+0

@styu我看着你的要求,但我不明白OP的問題。您可以嘗試將'LIBXML_NOCDATA'選項傳遞給'load'調用。由於抓取的頁面是有效的XHTML,因此您可能還想使用XML解析器而不是HTML解析器。 – Gordon 2010-11-20 09:53:03

+0

@Gordon:turbod澄清在[佩卡的回答(http://stackoverflow.com/questions/2735291/domdocument-class-unable-access-domnode/2735318#2735318),他希望使網站的純文字版,包括'a'和'img'標籤的'title'和'alt'屬性。正如我所看到的那樣,使用'load()'它不像預期的那樣工作,但我不知道爲什麼(在這種情況下它不提取屬性)。 – 2010-11-20 14:08:59

1

我不知道我得到這個腳本的作用 - 替換操作看起來像在衛生嘗試,但我不知道是什麼,如果你只是提取代碼的某些部分 - 但有你試過Simple HTML DOM Browser?它可以更容易地處理解析部分。看看例子。

+0

我需要明文和alt和title屬性。例如:你好你好這個網站alt attrclick一些文本。 我需要這個輸出:你好你好這個網站alt attr title attr alt attr title attr click Some Text。 – turbod 2010-04-29 07:15:18

+0

@turbod簡單的HTML DOM瀏覽器可以同時執行這兩個操作。明文應該是'$ html-> find(「body」,0) - > plaintext'類似的東西,看看網站上的例子,看看如何遍歷所有標籤列表來獲得他們的'alt'和'title' atributes。 – 2010-04-29 07:16:42

+0

現在我讀了這些例子,但是我找不到如何去做。 我需要明文和alt和title屬性同時。 – turbod 2010-04-29 07:28:21

1

以下是僅供比較的Simple Html DOM Parser解決方案。它的輸出是DomDocument solution的相似,但是這一個是更復雜,運行慢得多(反對的DomDocument的〜2300ms〜100毫秒),所以我不建議使用它:

更新工作與<a>元素中的<img>元素。

<?php 
require_once('simple_html_dom.php'); 
// we are needing this because Simple Html DOM Parser's callback handler 
// doesn't handle arguments 
static $processed_plain_text = ''; 

define('LOAD_FROM_URL', 'loadfromurl'); 
define('LOAD_FROM_STRING', 'loadfromstring'); 

function callback_cleanNestedAnchorContent($element) 
{ 
    if ($element->tag == 'a') 
     $element->innertext = makePlainText($element->innertext, LOAD_FROM_STRING); 
} 

function callback_buildPlainText($element) 
{ 
    global $processed_plain_text; 

    $excluded_tags = array('script', 'style'); 

    switch ($element->tag) 
    { 
     case 'text': 
      // filter when 'text' is descendant of 'a', because we are 
      // processing the anchor tags with the required attributes 
      // separately at the 'a' tag, 
      // and also filter out other unneccessary tags 
      if (($element->parent->tag != 'a') && !in_array($element->parent->tag, $excluded_tags)) 
       $processed_plain_text .= $element->innertext . ' '; 
      break; 
     case 'img': 
      $processed_plain_text .= $element->alt . ' '; 
      $processed_plain_text .= $element->title . ' '; 
      break; 
     case 'a': 
      $processed_plain_text .= $element->alt . ' '; 
      $processed_plain_text .= $element->title . ' '; 
      $processed_plain_text .= $element->innertext . ' '; 
      break; 
    } 
} 

function makePlainText($source, $mode = LOAD_FROM_URL) 
{ 
    global $processed_plain_text; 

    if ($mode == LOAD_FROM_URL) 
     $html = file_get_html($source); 
    elseif ($mode == LOAD_FROM_STRING) 
     $html = str_get_dom ($source); 
    else 
     return 'Wrong mode defined in makePlainText: ' . $mode; 

    $html->set_callback('callback_cleanNestedAnchorContent'); 

    // processing with the first callback to clean up the anchor tags 
    $html = str_get_html($html->save()); 
    $html->set_callback('callback_buildPlainText'); 

    // processing with the second callback to build the full plain text with 
    // the required attributes of the 'img' and 'a' tags, and excluding the 
    // unneccessary ones like script and style tags 
    $html->save(); 

    $return = $processed_plain_text; 

    // cleaning the global variable 
    $processed_plain_text = ''; 

    return $return; 
} 

//$html = '<html><title>Hello</title><body>Hello <span>this</span> site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click <span><strong>HERE</strong></span><img src="image.jpg" title="IMAGE TITLE INSIDE ANCHOR" alt="ALTINACNHOR"></a> Some text.</body></html>'; 

echo makePlainText('http://foldmunka.net'); 
//echo makePlainText($html, LOAD_FROM_STRING);