爲什麼這個遞歸停止

Im新的python和我的代碼如下：我有一個爬蟲，在新發現的鏈接上遞歸。在根鏈接上遞歸之後，似乎程序在打印幾條鏈接後停止，這應該繼續一段時間，但不是。我正在捕捉和打印異常，但程序終止成功，所以我不知道爲什麼它會停止。爲什麼這個遞歸停止

from urllib import urlopen 
from bs4 import BeautifulSoup 

def crawl(url, seen): 
    try: 
    if any(url in s for s in seen): 
     return 0 
    html = urlopen(url).read() 

    soup = BeautifulSoup(html) 
    for tag in soup.findAll('a', href=True): 
     str = tag['href'] 
     if 'http' in str: 
     print tag['href'] 
     seen.append(str) 
     print "--------------" 
     crawl(str, seen) 
    except Exception, e: 
     print e 
     return 0 

def main(): 
    print "$ = " , crawl("http://news.google.ca", []) 


if __name__ == "__main__": 
    main()

來源

2012-07-28 Mike G

嘗試記錄您爲每個請求收到的html。也許有些網站由於缺少用戶代理或其他缺少http頭部而給你空白結果？此外，href可能不包含協議（http或https），這意味着您將跳過它。 – Steve 2012-07-28 09:21:03

try: 
    if any(url in s for s in seen): 
     return 0

然後

seen.append(str) 
print "--------------" 
crawl(str, seen)

您可以附加str到seen，然後調用crawl與str和seen作爲參數。顯然你的代碼會退出。你以這種方式設計了它。

更好的方法是抓取一個頁面，將找到的所有鏈接添加到要抓取的列表中，然後繼續抓取該列表中的所有鏈接。

簡而言之，您不應該先進行深度優先爬網，而應該首先執行廣度優先爬網。

這樣的事情應該工作。

from urllib import urlopen 
from bs4 import BeautifulSoup 

def crawl(url, seen, to_crawl): 
    html = urlopen(url).read() 
    soup = BeautifulSoup(html) 
    seen.append(url) 
    for tag in soup.findAll('a', href=True): 
     str = tag['href'] 
     if 'http' in str: 
      if url not in seen and url not in to_crawl: 
       to_crawl.append(str) 
       print tag['href'] 
       print "--------------" 
    crawl(to_crawl.pop(), seen, to_crawl) 

def main(): 
    print "$ = " , crawl("http://news.google.ca", [], []) 


if __name__ == "__main__": 
    main()

儘管您可能想要限制它將爬行的URL的最大深度或最大數量。

來源

2012-07-28 09:30:32 elssar

for tag in soup.findAll('a', href=True): 
     str = tag['href'] 
     if 'http' in str: 
      print tag['href'] 
      seen.append(str)  # you put the newly founded url to *seen* 
      print "--------------" 
      crawl(str, seen)  # then you try to crawl it

但是，在開始的crawl

if any(url in s for s in seen): # you don't crawl url in *seen* 
    return 0

你應該追加url當你真的爬它，而不是當你發現它。

來源

2012-07-28 09:31:44 xiaowl

爲什麼這個遞歸停止

回答

相關問題