2015-05-04 77 views
10

我刮一些網頁與scrapy並得到以下錯誤:如何在使用Scrapy時防止出現twisted.internet.error.ConnectionLost錯誤?

twisted.internet.error.ConnectionLost

我的命令行輸出:

2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened 
2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished) 
2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 36, 
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36, 
'downloader/request_bytes': 8121, 
'downloader/request_count': 36, 
'downloader/request_method_count/GET': 36, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2015, 5, 4, 10, 40, 35, 608377), 
'log_count/DEBUG': 38, 
'log_count/ERROR': 12, 
'log_count/INFO': 7, 
'scheduler/dequeued': 36, 
'scheduler/dequeued/memory': 36, 
'scheduler/enqueued': 36, 
'scheduler/enqueued/memory': 36, 
'start_time': datetime.datetime(2015, 5, 4, 10, 40, 32, 624695)} 
2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished) 

settings.py

SPIDER_MODULES = ['proxy.spiders'] 
    NEWSPIDER_MODULES = 'proxy.spiders' 

    DOWNLOAD_DELAY = 0 
    DOWNLOAD_TIMEOUT = 30 

    ITEM_PIPELINES = { 
       'proxy.pipelines.ProxyPipeline':100, 
    } 

    CONCURRENT_ITEMS = 100 
    CONCURRENT_REQUESTS_PER_DOMAIN = 64 
    #CONCURRENT_SPIDERS = 128 

    LOG_ENABLED = True 
    LOG_ENCODING = 'utf-8' 
    LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log' 
    LOG_LEVEL = 'DEBUG' 
    LOG_STDOUT = False 

我的蜘蛛proxy_spider.py

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor 
from proxy.items import ProxyItem 
import re 

class ProxycrawlerSpider(CrawlSpider): 
    name = 'cnproxy' 
    allowed_domains = ['www.cnproxy.com'] 
    indexes = [1,2,3,4,5,6,7,8,9,10] 
    start_urls = [] 
    for i in indexes: 
     url = 'http://www.cnproxy.com/proxy%s.html' % i 
     start_urls.append(url) 
    start_urls.append('http://www.cnproxy.com/proxyedu1.html') 
    start_urls.append('http://www.cnproxy.com/proxyedu2.html') 

    def parse_ip(self,response): 
     sel = HtmlXPathSelector(response) 
     addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}') 
     protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>') 
     locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>') 
     ports_re = re.compile('write\(":"(.*)\)') 
     raw_ports = ports_re.findall(response.body); 
     port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''} 
     ports = [] 
     for port in raw_ports: 
      tmp = port 
      for key in port_map: 
       tmp = tmp.replace(key,port_map[key]); 
      ports.append(tmp) 
     items = [] 
     for i in range(len(addresses)): 
      item = ProxyItem() 
      item['address'] = addresses[i] 
      item['protocol'] = protocols[i] 
      item['location'] = locations[i] 
      item['port'] = ports[i] 
      items.append(item) 
     return items 

有什麼毛病我管線或設置? 如果不是我怎麼能防止twisted.internet.error.ConnectionLost錯誤。

我試過scrapy shell

$scrapy shell http://www.cnproxy.com/proxy1.html 

和題爲得到同樣的錯誤。 但我可以通過我的瀏覽器訪問該頁面。並且我嘗試了其他頁面,如

$scrapy shell http://stackoverflow.com 

它們都工作正常。

+1

這看起來更關係到比scrapy扭曲。 – eLRuLL

+0

謝謝,那麼扭曲會有什麼問題呢?我是全新的扭曲,不知道該怎麼做。任何幫助,將不勝感激! – April

回答

5

您需要設置一個用戶代理字符串。看起來有些網站不喜歡它,並阻止你的用戶代理不是瀏覽器。 你可以找到examples of user agent strings

article識別最佳實踐,以阻止你的蜘蛛被阻塞。

打開settings.py:添加下列用戶代理

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'

您也可以嘗試user-agent randomiser

相關問題