簡單的HTML DOM只解析名稱和含有PDF鏈接鏈接

我試圖解析一些html頁面：簡單的HTML DOM只解析名稱和含有PDF鏈接鏈接

<div class="gs_r"><h3 class="gs_rt"><span class="gs_ctc">[BOOK]</span> <a href="http://exampleA.com" onmousedown="return scife_clk(this.href,'','res','1')">titleA</a></h3><div class="gs_ggs gs_fl"><a href="http://exampleApdf.pdf" onmousedown="return scife_clk(this.href,'gga','gga','1')"> 
<div class="gs_r"><h3 class="gs_rt"><span class="gs_ctc">[BOOK]</span> <a href="http://exampleB.com" onmousedown="return scife_clk(this.href,'','res','1')">titleB</a></h3><div class="gs_ggs gs_fl"><a href="http://exampleB.doc" onmousedown="return scife_clk(this.href,'gga','gga','1')">

從這個HTML頁面，我們可以得到的信息：頁面鏈接（HTTP：//照例a .com，http：//exampleB.com），標題（titleA，titleB），文檔鏈接（http：//exampleApdf.pdf,http：//exampleB.doc）但是，我只想獲取有pdf鏈接的文件。所以從這個例子中，我只想得到：http://exampleA.com，titleA，http://exampleApdf.pdf。我試過，但它給了我空白的結果。我怎麼能他們？謝謝！ :) 下面的代碼：

<?php 

include 'simple_html_dom.php'; 
$url = 'http://scholar.google.com/scholar?hl=en&q=data+mining&btnG=&as_sdt=1%2C5&as_sdtp='; 
$html = file_get_html($url); 
foreach($html->find('div[class=gs_ggs gs_fl]')as $pdfLink){ 
    if (preg_match('/\.pdf$/i', $pdfLink)) { 
     $html2->find('span[class=gs_ctc]'); 
     echo $html2.$pdfLink; 
    } 
} 

?>

來源

2012-07-18 bruine

不能從將返回什麼樣的資源的URL確定。

並非所有人都提供帶有.pdf擴展名的PDF文件。並非所有Web服務都會顯示磁盤上文件的文件名。應僅使用Content-Type HTTP響應標頭來確定資源的類型。

對於您找到的每個URL，您可以通過doing a HEAD request有效地獲得此效果。

來源

2012-07-18 01:19:04 Brad

哦，是的，謝謝，我會了解它。但如果我在同一時間將curl和simple_html_dom結合起來，那麼可以嗎？因爲我需要獲得鏈接和標題的信息.. – bruine 2012-07-18 01:27:49

@igos，是的，絕對。請記住在您的cURL請求上設置超時時間。 – Brad 2012-07-18 02:09:00

好的，再次感謝..我會嘗試！ – bruine 2012-07-18 02:15:29

簡單的HTML DOM只解析名稱和含有PDF鏈接鏈接

回答

相關問題