2014-12-08 91 views
2

我有這個簡單的網絡爬蟲返回從谷歌搜索結果頁面中所有鏈接(標籤),但是,我的preg_match功能似乎並沒有被返回相關的鏈接我想這是在兩者之間2個字符串。我相信我的正則表達式是正確的,但我已經在其他幾個平臺上測試過了。正則表達式不與網絡爬蟲工作

foreach($html->find('a') as $element) { 

preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //attempt to retrieve the  actual link in between these strings 

echo $element->href.'<br/>'; //prints out each of the links 

} 

print_r($matches); 

這裏是我試圖檢索時像林尋找名爲John Smith

https://www.google.com/webhp?tab=ww 
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbm=isch&source=og&sa=N&tab=wi 
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wl 
https://play.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=w8 
https://www.youtube.com/results?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=w1 
https://news.google.com/nwshp?hl=en&tab=wn 
https://mail.google.com/mail/?tab=wm 
https://drive.google.com/?tab=wo 
http://www.google.com/intl/en/options/ 
https://www.google.com/calendar?tab=wc 
https://translate.google.com/?q=John+Smith&um=1&ie=UTF-8&hl=en&sa=N&tab=wT 
http://www.google.com/mobile/?hl=en&tab=wD 
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=bks&source=og&sa=N&tab=wp 
https://wallet.google.com/manage/?tab=wa 
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=shop&source=og&sa=N&tab=wf 
https://www.blogger.com/?tab=wj 
https://www.google.com/finance?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=we 
https://plus.google.com/photos?q=John+Smith&um=1&ie=UTF-8&sa=N&tab=wq 
https://www.google.com/search?q=John+Smith&um=1&ie=UTF-8&hl=en&tbo=u&tbm=vid&source=og&sa=N&tab=wv 
http://www.google.com/intl/en/options/ 
https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/search%3Fq%3DJohn%2BSmith 
http://www.google.com/preferences?hl=en 
/preferences?hl=en 
http://www.google.com/history/optout?hl=en 
/webhp?hl=en 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=isch&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAUQ_AU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=vid&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAYQ_AU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=nws&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAcQ_AU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=shop&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAgQ_AU 
https://maps.google.com/maps?q=John+Smith&um=1&ie=UTF-8&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAkQ_AU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnms&tbm=bks&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CAoQ_AU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:h&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:d&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:w&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:m&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=qdr:y&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&source=lnt&tbs=li:1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CA8QpwU 
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBQQFjAA&usg=AFQjCNFgBV3CPR5ydtty6z72kDKto_Ij7A 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:2n5isO4EbUAJ:http://en.wikipedia.org/wiki/John_Smith_(explorer)%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBcQIDAA&usg=AFQjCNGxUvb-aHUJmV-p4VbGXmUJE1nPBw 
/search?ie=UTF-8&q=related:en.wikipedia.org/wiki/John_Smith_(explorer)+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBgQHzAA 
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Early_adventures&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBoQ0gIoADAA&usg=AFQjCNFK7RzMUfQA5LZYUNaL2C_K0cEbjA 
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23In_Jamestown&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBsQ0gIoATAA&usg=AFQjCNF0pFVxwtdohofHr3bWQXJhk1XMcA 
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23New_England&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CBwQ0gIoAjAA&usg=AFQjCNE4VqtjkQwsNzO_haCNSUi-3bgTsw 
/url?q=http://en.wikipedia.org/wiki/John_Smith_(explorer)%23Death_and_burial&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB0Q0gIoAzAA&usg=AFQjCNFAr4O8yWEK93_GyyN6_srpqLaljQ 
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CB8QFjAB&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:iuJ7Uh7IOtgJ:http://www.apva.org/history/jsmith.html%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCIQIDAB&usg=AFQjCNG_keb3HZAHUteBGMb3k5GTIeVr5w 
/search?ie=UTF-8&q=related:www.apva.org/history/jsmith.html+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCMQHzAB 
/images?q=John+Smith&hl=en&sa=X&oi=image_result_group&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCUQsAQ 
/url?q=http://etc.usf.edu/clipart/200/269/smith_2.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCcQ9QEwAg&usg=AFQjCNF3B9TL94enKovOL1hlz-n0A4PXrA 
/url?q=http://www.apva.org/history/jsmith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCkQ9QEwAw&usg=AFQjCNEMx0-702N1edJVXxiS5ILRl651zw 
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CCsQ9QEwBA&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ 
/url?q=http://www.shmoop.com/jamestown/photo-john-smith.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC0Q9QEwBQ&usg=AFQjCNFvEq7Cq3P6WdNIIHpNVVuQLTMhdQ 
/url?q=http://www.wpclipart.com/American_History/settlement/John_Smith/Captain_John_Smith.png.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CC8Q9QEwBg&usg=AFQjCNGEWlYKoQUhODn-3jypeyaw4urAGw 
/url?q=http://www.web-books.com/Classics/ON/B1/B1583/07MB1583.html&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDEQ9QEwBw&usg=AFQjCNGSF2DNQHhwDTHz4ogVcLVhM5TiDQ 
/url?q=http://www.biography.com/people/john-smith-9486928&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDMQFjAI&usg=AFQjCNEdM50NAIJCmLRDMG_Ruyox4gshPQ 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:IJvKbJ_a540J:http://www.biography.com/people/john-smith-9486928%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDYQIDAI&usg=AFQjCNHnW1ezRcv8sn_Jk3GBvECp-QOCTg 
/search?ie=UTF-8&q=related:www.biography.com/people/john-smith-9486928+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDcQHzAI 
/url?q=http://johnsmithjohnsmith.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDkQFjAJ&usg=AFQjCNH9a_jF2woyDESMRrLneIIbbTeS4g 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:_KyTfWhQuFEJ:http://johnsmithjohnsmith.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CDwQIDAJ&usg=AFQjCNGX37w0NUcEFa0t04-28gLhlMVfdA 
/search?ie=UTF-8&q=related:johnsmithjohnsmith.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD0QHzAJ 
/url?q=http://www.johnsmith.co.uk/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CD8QFjAK&usg=AFQjCNHEhG7WRm1dP5c_0xqqH0P0U-9jUA 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:jPrP5TbGXhYJ:http://www.johnsmith.co.uk/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEIQIDAK&usg=AFQjCNFe-QSMSKMs8Z6mSu-oLraaeKYAug 
/search?ie=UTF-8&q=related:www.johnsmith.co.uk/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEMQHzAK 
/url?q=http://www.johnsmith.co.uk/uel&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEUQ0gIoADAK&usg=AFQjCNEk2GkTaQvtpqaaYdztlWV7iVs0Jg 
/url?q=http://www.johnsmith.co.uk/bedfordshire&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEYQ0gIoATAK&usg=AFQjCNFcOIItpAW46XRn1BwGvuG7mertRA 
/url?q=http://www.johnsmith.co.uk/aru&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEcQ0gIoAjAK&usg=AFQjCNFq68oEVG7KAAu-Mbd0ScBFOMF4MA 
/url?q=http://www.history.com/topics/john-smith&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEkQFjAL&usg=AFQjCNGytp4P2oI3szUVSzJbJ1YdOWDldw 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:5hQtC90uVmYJ:http://www.history.com/topics/john-smith%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CEwQIDAL&usg=AFQjCNERGtQrhvZLOovq8W-Mp8AXeT_W1g 
/search?ie=UTF-8&q=related:www.history.com/topics/john-smith+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE0QHzAL 
/url?q=http://johnsmithmusic.com/&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CE8QFjAM&usg=AFQjCNFlpAC8HDml6r5DpmAo4VviZ_GeMw 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:-T7dO31PjlkJ:http://johnsmithmusic.com/%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFIQIDAM&usg=AFQjCNFFeePBNGGMWPaVS9j4_niZpMVyxA 
/search?ie=UTF-8&q=related:johnsmithmusic.com/+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFMQHzAM 
/url?q=http://www.nps.gov/jame/historyculture/life-of-john-smith.htm&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFUQFjAN&usg=AFQjCNHPmqp05pAUp2yk1R9aKPqohTmWpQ 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:Q_nfCPRpnwQJ:http://www.nps.gov/jame/historyculture/life-of-john-smith.htm%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFgQIDAN&usg=AFQjCNHad3eFxSDuthM23n4FcusD5rY1uw 
/search?ie=UTF-8&q=related:www.nps.gov/jame/historyculture/life-of-john-smith.htm+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFkQHzAN 
/url?q=http://www.enchantedlearning.com/explorers/page/s/smith.shtml&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CFsQFjAO&usg=AFQjCNEWo4pji9pBq89XmlprWg2okGHl5g 
/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:zs0buZvw9N8J:http://www.enchantedlearning.com/explorers/page/s/smith.shtml%252BJohn%2BSmith%26hl%3Den%26%26ct%3Dclnk&sa=U&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF4QIDAO&usg=AFQjCNEu0cbayJymDVJ4IfbRc_NtrEtaPA 
/search?ie=UTF-8&q=related:www.enchantedlearning.com/explorers/page/s/smith.shtml+John+Smith&tbo=1&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CF8QHzAO 
/search?ie=UTF-8&q=john+smith+texture+pack&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGIQ1QIoAA 
/search?ie=UTF-8&q=john+smith+and+pocahontas&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGMQ1QIoAQ 
/search?ie=UTF-8&q=john+smith+actor&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGQQ1QIoAg 
/search?ie=UTF-8&q=john+smith+realty&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGUQ1QIoAw 
/search?ie=UTF-8&q=john+smith+doctor+who&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGYQ1QIoBA 
/search?ie=UTF-8&q=captain+john+smith&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGcQ1QIoBQ 
/search?ie=UTF-8&q=john+smith+wrestler&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGgQ1QIoBg 
/search?ie=UTF-8&q=john+smith+wrestling&revid=1367094011&sa=X&ei=9UuFVLvZJ5KLuASVi4KABQ&ved=0CGkQ1QIoBw 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=20&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=30&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=40&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=50&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=60&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=70&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=80&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=90&sa=N 
/search?q=John+Smith&ie=UTF-8&prmd=ivnsp&ei=9UuFVLvZJ5KLuASVi4KABQ&start=10&sa=N 
/advanced_search?q=John+Smith&ie=UTF-8&prmd=ivnsp 
/support/websearch/bin/answer.py?answer=134479&hl=en 
/tools/feedback/survey/html?productId=196&query=John+Smith&hl=en 
/
/intl/en/ads 
/services 
/intl/en/policies/ 
/intl/en/about.html 
array(0) { } 
+0

只是檢查你的'$ matches'內循環,並檢查它是否取得了比賽 – Ghost 2014-12-08 07:03:43

+0

'每次$ matches'被覆蓋。您應該創建一個數組,在其中**每次添加**每個項目的內容。否則,失敗總是返回一個空集,它很可能是最後一個url(即google的頁面按鈕),將會清空'$ matches'。 – 2014-12-08 07:05:30

+0

什麼是2個字符串,你想要字符串之間的文本。開始和結束? – 2014-12-08 07:09:22

回答

1

與您的代碼的問題有人相關鏈接的頁面,是什麼,每次時間試圖匹配一個元素,$matches是一個新的數組。

一個可能的解決方案:

$result = array(); 
foreach($html->find('a') as $element) { 
    preg_match_all("/url\?q=(.*?)&sa=U&ei=/", $element->href, $matches); //try to match 
    if(array_key_exists(1,$matches) && $matches[1] != "") { //if we found a match 
     $result[] = $matches[1]; //push it to $results 
    } 
} 
print_r($result);//print result 

另一種方式是當然的,試圖找到某種在生成的HTML頁面的標記。您可以通過將HTML文檔轉換爲XML然後進行分析來完成此操作。然而,這種方法的問題是,現在,然後,谷歌可以修改它的頁面佈局,因此,你需要重寫你的算法。

+0

我試着在每次迭代打印出陣列, preg_match(「/ url \?q =(。*?)&sa = U&ei = /「,$ element-> href,$ match); print_r($ match); echo $ element-> href。'
'; 但仍然返回一個空數組。我試着去找到URL的字符串之間的「/ URL] Q =」和「&SA = U&EI =」 – 2014-12-08 07:23:57

+0

好吧,我複製你的榜樣的URL列表,並將其返回的所有「/ URL] Q =數組.. 」網址... – 2014-12-08 07:31:59

+0

這就是爲什麼我這麼問是因爲通過測試正則表達式本身,它返回我想要的結果,它只是不履帶代碼中在這種情況下這樣做:( – 2014-12-08 14:35:22