從PHP中提取HTML的某些部分

好的，所以我正在用PHP編寫一個應用程序來檢查我的網站，如果所有的鏈接都是有效的，所以我可以在必要時更新它們。從PHP中提取HTML的某些部分

我遇到了一個問題。我試圖使用SimpleXml和DOMDocument對象來提取標籤，但是當我使用示例站點運行應用程序時，如果使用SimpleXml對象類型，通常會出現大量錯誤。

那麼有沒有一種方法可以掃描html文檔中的href屬性，這與使用SimpleXml非常簡單？

<?php 
    // what I want to do is get a similar effect to the code described below: 

    foreach($html->html->body->a as $link) 
    { 
     // store the $link into a file 
     foreach($link->attributes() as $attribute=>$value); 
     { 
       //procedure to place the href value into a file 
     } 
    } 
?>

所以基本上我正在尋找一種方法來執行上述操作。事情是我目前越來越困惑，我應該如何對待字符串，我得到的HTML代碼中...

只是要清楚，我使用以下原始方式獲取HTML文件：

<?php 
$target  = "http://www.targeturl.com"; 

$file_handle = fopen($target, "r"); 

$a = ""; 

while (!feof($file_handle)) $a .= fgets($file_handle, 4096); 

fclose($file_handle); 
?>

任何信息將是有用的，以及任何其他語言的替代品，其中上述問題是更優雅固定（蟒蛇，C或C++）

來源

2012-03-16 Toni Kostelac

您可以使用DOMDocument::loadHTML

這裏的代碼一堆，我們使用了HTML解析工具，我們寫了。

$target = "http://www.targeturl.com"; 
$result = file_get_contents($target); 
$dom = new DOMDocument; 
$dom->preserveWhiteSpace = false; 
@$dom->loadHTML($result); 

$links = extractLink(getTags($dom, 'a',)); 

function extractLink($html, $argument = 1) { 
    $href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si'; 

    preg_match_all($href_regex_pattern,$html,$matches); 

    if (count($matches)) { 

    if (is_array($matches[$argument]) && count($matches[$argument])) { 
     return $matches[$argument][0]; 
    } 

    return $matches[1]; 
    } else 

function getTags($dom, $tagName, $element = false, $children = false) { 
    $html = ''; 
    $domxpath = new DOMXPath($dom); 

    $children = ($children) ? "/".$children : ''; 
    $filtered = $domxpath->query("//$tagName" . $children); 

    $i = 0; 
    while($myItem = $filtered->item($i++)){ 
     $newDom = new DOMDocument; 
     $newDom->formatOutput = true;   

     $node = $newDom->importNode($myItem, true); 

     $newDom->appendChild($node); 
     $html[] = $newDom->saveHTML();   
    } 

    if ($element !== false && isset($html[$element])) { 
     return $html[$element]; 
    } else 
     return $html; 
}

來源

2012-03-16 22:43:34

漂亮的東西更優雅然後，通過sonassi提供上述解決方案（在我看來）更優雅的解決問題的方法，但感謝，我肯定會放棄這是一個鏡頭，需要查找一些東西，但我認爲這不會是一個問題，現在我看到我需要尋找的 – 2012-03-16 23:40:36

DOMDocument和DOMXPath非常棒，甚至對於糟糕/破碎的HTML也非常寬容。有你可以用它做的負載:) – 2012-03-16 23:43:31

是的，我剛剛開始嘗試使用DOMXPath，它的接縫很有趣。然而，我需要一個更深層次的文檔，然後在php.net提供的文檔中，這些例子並不像我希望的那樣具有豐富的內容。 – 2012-03-17 00:06:39

你可以只使用strpos($html, 'href=')，然後解析網址。您也可以搜索<a或.php

來源

2012-03-16 22:37:26 PhpXp

我需要努力:) – 2012-03-16 23:41:53

從PHP中提取HTML的某些部分

回答

相關問題