使用DOM從外部網站選擇性提取數據PHP網絡爬蟲

我有這個PHP DOM網絡爬蟲，它工作正常。它提取提及的標籤以及從（外部）論壇站點到我的頁面的鏈接。使用DOM從外部網站選擇性提取數據PHP網絡爬蟲

但最近我遇到了一個問題。像

這是論壇數據的HTML ::

<tbody> 
<tr> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837880.php" target="_top" class="Links2">Hispanic Study Partner</a> - dreamer1984</td> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="14%" height="25" class="FootNotes2" align="center">02/28/17 01:42</td> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="9%" height="25" align="Center" class="FootNotes2">200</td> 
</tr> 
<tr> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="64%" height="25" class="FootNotes2"><a href="/files/forum/2017/1/837879.php" target="_top" class="Links2">nbme</a> - monariyadh</td> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="14%" height="25" class="FootNotes2" align="center">02/27/17 23:12</td> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="8%" height="25" align="Center" class="FootNotes2">0</td> 
    <td width="1%" height="25">&nbsp;</td> 
    <td width="9%" height="25" align="Center" class="FootNotes2">108</td> 
</tr> 
</tbody>

現在，如果我們考慮到上面的代碼（表數據）在網站上提供的唯一語句。如果我試圖用一個網絡爬蟲一樣提取它，

<?php 
    require_once('dom/simple_html_dom.php'); 
    $html = file_get_html('http://www.sitename.com/'); 
    foreach($html->find('td.FootNotes2') as $element) { 
    echo $element; 
} 
?>

它提取人與類名是內顯示爲「FootNote2」數據

現在，如果我想提取特定數據標籤，例如第一個標籤/行中的名稱，如「dreamer1984」和「monariyadh」。

以及如果我想從第3個數據中提取數據（跳過其餘的），它具有相同的類名稱。

請注意，我可以使用「正則表達式」像

preg_match_all('/<td.+?FootNotes2.+?<a.+?<\/a> - (?P<name>.*?)<\/td>.+?<td.+?FootNotes2.+?(?P<date>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2})/siu', $subject, $matchs); 

foreach ($matchs['name'] as $k => $v){ 
    var_dump('name: '. $v, 'relative date: '. $matchs['date'][$k]); 
}

但我更喜歡找到DOM解析器這種解決方案任何幫助表示讚賞..

來源

2017-03-01 harishk

可能的重複[使用DOM PHP Web爬蟲從論壇網站中選擇數據提取]（http://stackoverflow.com/questions/42511008/selective-data-extraction-from-forum-site-using-dom-php-網絡爬蟲） –

某些文本解析將是必要的（例如，通過正則表達式）我不認爲你可以避免這種情況。您可以做的最好的做法是將正則表達式位限制爲td元素的文本內容。 – apokryfos

@harishk檢查我的答案是你想要什麼？ –

正如我在說我的評論一些文字處理是不可避免的，但你可以得到與TD像這樣相關的文本元素：

require_once('dom/simple_html_dom.php'); 
$html = file_get_html('http://www.sitename.com/'); 
foreach ($html->find("tr") as $row) { 
     $element = $row->find('td.FootNotes2',0); 
     if ($element == null) { continue; } 
     $textNode = array_filter($element->nodes, function ($n) { 
      return $n->nodetype == 3;  //Text node type, like in jQuery  
     }); 

     if (!empty($textNode)) { 
      $text = current($textNode); 
      echo $text;   
     } 

    }

此相呼應：

- dreamer1984 
- monariyadh

這樣做你會。

更新爲只能找到每個tr的第一個td。

來源

2017-03-01 09:03:18 apokryfos

好的，但是如何避免像數字那樣的最後兩件事情，如「0,200和0,108」......以及如果我想在不同的地方回顯姓名（dreamer1984）和日期...？ – harishk

@harishk更新。現在它查找每行的行和第一個「td.Footnotes2」。如果你還想要第三個元素，那麼也執行'find（...，2）'。 – apokryfos

Dude，它打印確切的，但重複發出兩個錯誤，如'警告：array_filter（）期望參數1是數組，null給出'和'注意：嘗試獲取非對象的屬性'$ textNode = array_filter （$ element-> nodes，function（$ n）{'line ... – harishk

如果你想提取純文本（不是標籤及其包含）

foreach ($html->find("td.FootNotes2") as $element) { 

    $children = $element->children; // get an array of children 
    foreach ($children AS $child) { 
     $child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv 
    } 
    echo $element->innertext."<br>"; 
}

O/P：

- dreamer1984 
02/28/17 01:42 
0 
200 
- monariyadh 
02/27/17 23:12 
0 
108

來源

2017-03-01 10:02:22

是的，其實，但我只需要前兩列,,,, – harishk

事實上，其他人回答給我的解決方案，但除了當我使用一個while循環時，它給錯誤...請過來討論室檢查出來... http：//chat.stackoverflow.com/rooms/136942/discussion-between-harishk-and-apokryfos – harishk

你有夥計？ – harishk

你必須使用正則表達式兩種方式，使沒有意義過於複雜吧：

foreach($html->find('tr') as $tr) { 
    echo preg_replace('/.* - /', '', $tr->find('td',1)->text()) . "\n"; 
    echo $tr->find('td',3)->text() . "\n"; 
}

我真的不喜歡apokryfos的方法，這是很多困惑，沒有任何好處。

來源

2017-03-02 00:42:51 pguardiario

使用DOM從外部網站選擇性提取數據PHP網絡爬蟲

回答

相關問題