2017-09-24 217 views
0

我在python中寫了一個簡單的爬蟲。它似乎工作正常,並找到新的鏈接,但重複發現相同的鏈接,而不是下載找到的新網頁。它似乎即使在達到設定的爬行深度限制後也會無限爬行。我沒有收到任何錯誤。它只是永遠運行。這是代碼和運行。我在Windows 7 64位上使用Python 2.7。Python簡單的網絡爬蟲錯誤(無限循環爬行)

import sys 
import time 
from bs4 import * 
import urllib2 
import re 
from urlparse import urljoin 

def crawl(url): 
    url = url.strip() 
    page_file_name = str(hash(url)) 
    page_file_name = page_file_name + ".html" 
    fh_page = open(page_file_name, "w") 
    fh_urls = open("urls.txt", "a") 
    fh_urls.write(url + "\n") 
    html_page = urllib2.urlopen(url) 
    soup = BeautifulSoup(html_page, "html.parser") 
    html_text = str(soup) 
    fh_page.write(url + "\n") 
    fh_page.write(page_file_name + "\n") 
    fh_page.write(html_text) 
    links = [] 
    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}): 
    links.append(link.get('href')) 
    rs = [] 
    for link in links: 
    try: 
      #r = urllib2.urlparse.urljoin(url, link) 
      r = urllib2.urlopen(link) 
      r_str = str(r.geturl()) 
      fh_urls.write(r_str + "\n") 
      #a = urllib2.urlopen(r) 
      if r.headers['content-type'] == "html" and r.getcode() == 200: 
       rs.append(r) 
       print "Extracted link:" 
     print link 
     print "Extracted link final URL:" 
     print r 
    except urllib2.HTTPError as e: 
      print "There is an error crawling links in this page:" 
      print "Error Code:" 
      print e.code 
    return rs 
    fh_page.close() 
    fh_urls.close() 

if __name__ == "__main__": 
    if len(sys.argv) != 3: 
    print "Usage: python crawl.py <seed_url> <crawling_depth>" 
    print "e.g: python crawl.py https://www.yahoo.com/ 5" 
    exit() 
    url = sys.argv[1] 
    depth = sys.argv[2] 
    print "Entered URL:" 
    print url 
    html_page = urllib2.urlopen(url) 
    print "Final URL:" 
    print html_page.geturl() 
    print "*******************" 
    url_list = [url, ] 
    current_depth = 0 
    while current_depth < depth: 
     for link in url_list: 
      new_links = crawl(link) 
      for new_link in new_links: 
       if new_link not in url_list: 
        url_list.append(new_link) 
      time.sleep(5) 
      current_depth += 1 
      print current_depth 

這裏是我,當我跑它:

C:\Users\Hussam-Den\Desktop>python test.py https://www.yahoo.com/ 4 
Entered URL: 
https://www.yahoo.com/ 
Final URL: 
https://www.yahoo.com/ 
******************* 
1 

和存儲抓取網址的輸出文件,這是一個:

https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://www.yahoo.com/ 
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html 
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm 
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm 
https://www.oath.com/careers/work-at-oath/ 
https://help.yahoo.com/kb/account 

任何想法有什麼不對?

+0

此代碼沒有正確縮進。 – jq170727

+0

這麼多縮進錯誤。你能解決它並上傳用於故障排除嗎? –

回答

1
  1. 您這裏有一個錯誤:depth = sys.argv[2]sys返回strint。你應該寫depth = int(sys.argv[2])
  2. 1點,條件while current_depth < depth:的Becouse總是返回True

嘗試轉換argv[2]解決它int。我瘦錯誤是有

+0

@ hussam-hallak上面的答案是正確的,我建議您查看python的arparse模塊,它可以爲您提供類似的功能 - 您可以在int中定義max_depth,並且可以完成所需的任務 - 非常有用的模塊。 –

+0

更重要的是:切換到Python 3.除了其他事項外,它將'int-str'比較標記爲錯誤,所以_this_問題顯而易見。而且明天你會試圖用不同的編碼來抓取網站,並且試圖將Python 2的方法導航到編碼。今天切換! – alexis

+0

@alexis,是的,Python3這是一個不錯的選擇。我不明白在Py2上啓動新項目的人) – AndMar