2015-11-13 104 views
0

我有一個使用Scrapy 1.0.3的項目。一切都運行良好,並沒有太大的變化,蜘蛛至少需要30分鐘執行。以下是督促環境的一些日誌:Scrapy需要30分鐘以上才能執行任何蜘蛛或scrapy長凳

0: 2015-11-13 12:00:50 INFO Log opened. 
1: 2015-11-13 12:00:50 INFO [scrapy.log] Scrapy 1.0.3.post6+g2d688cd started 
2: 2015-11-13 12:39:26 INFO [scrapy.utils.log] Scrapy 1.0.3.post6+g2d688cd started (bot: fancy) 
3: 2015-11-13 12:39:26 INFO [scrapy.utils.log] Optional features available: ssl, http11, boto 

你可以從日誌中看到,花了〜40分鐘,甚至開始。

從我的控制檯,如果我運行scrapy bench,scrapy listscrapy check我得到同樣的問題。

有沒有人有任何想法?

我檢查了我們的開發和產品環境,並有相同的問題。我認爲它可能是代碼相關的,但如果它只是影響基本的scrapy命令,我對這可能是什麼感到有點困惑。

正常的python腳本執行沒有問題。

這裏是回溯取消運行時:

^CTraceback (most recent call last): 
    File "/home/nitrous/code/trendomine/bin/scrapy", line 11, in <module> 
    sys.exit(execute()) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 142, in execute 
    cmd.crawler_process = CrawlerProcess(settings) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 209, in __init__ 
    super(CrawlerProcess, self).__init__(settings) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 115, in __init__ 
    self.spider_loader = _get_spider_loader(settings) 
     File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/crawler.py", line 296, in _get_spider_loader 
    return loader_cls.from_settings(settings.frozencopy()) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 30, in from_settings 
    return cls(settings) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 21, in __init__ 
    for module in walk_modules(name): 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules 
    submod = import_module(fullpath) 
    File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module 
    __import__(name) 
    File "/home/nitrous/code/trendomine/fancy/fancy/spiders/fancy_update_spider.py", line 11, in <module> 
    class FancyUpdateSpider(scrapy.Spider): 
    File "/home/nitrous/code/trendomine/fancy/fancy/spiders/fancy_update_spider.py", line 28, in FancyUpdateSpider 
    pg_r = requests.get(url, headers=headers) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/api.py", line 69, in get 
    return request('get', url, params=params, **kwargs) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/api.py", line 50, in request 
    response = session.request(method=method, url=url, **kwargs) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request 
    resp = self.send(prep, **send_kwargs) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send 
    r = adapter.send(request, **kwargs) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/adapters.py", line 370, in send 
    timeout=timeout 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 559, in urlopen 
    body=body, headers=headers) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 376, in _make_request 
    httplib_response = conn.getresponse(buffering=True) 
    File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse 
    response.begin() 
    File "/usr/lib/python2.7/httplib.py", line 415, in begin 
    version, status, reason = self._read_status() 
    File "/usr/lib/python2.7/httplib.py", line 371, in _read_status 
    line = self.fp.readline(_MAXLINE + 1) 
    File "/usr/lib/python2.7/socket.py", line 476, in readline 
    data = self._sock.recv(self._rbufsize) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 179, in recv 
    data = self.connection.recv(*args, **kwargs) 
    File "/home/nitrous/code/trendomine/local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1319, in recv 
    result = _lib.SSL_read(self._ssl, buf, bufsiz) 
KeyboardInterrupt 

感謝

+0

難道您發佈'settings.py'請。您是否嘗試過在其他文件夾中運行scrapy命令,特別是不屬於scrapy項目的文件夾?如果不行,看看問題是否仍然存在。 – Steve

回答

0

的問題,這是我正在執行一個巨大的GET請求來生成一個JSON文件在開始時被用作start_urls的腳本。爲了解決這個問題,我將其包裝在def start_requests(self)中,而不是爲所有請求使用巨大的JSON,而是在每個JSON獲取請求後執行此操作。

新代碼:

import scrapy 
from urlparse import urljoin 
import re 
import json 
import requests 
import math 
from scrapy.conf import settings 

from fancy.items import FancyItem 

def roundup(x): 
    return int(math.ceil(x/10.0)) * 10 

class FancyUpdateSpider(scrapy.Spider): 

    name = 'fancy_update' 
    allowed_domains = ['foo.com'] 

    def start_requests(self): 
    # Get URLS 
    url = 'https://www.foo.com/api/v1/product_urls?q%5Bcompany_in%5D%5B%5D=foo' 
    headers = {'X-Api-Key': settings['API_KEY'], 'Content-Type': 'application/json'} 
    r = requests.get(url, headers=headers) 
    # Get initial data 
    start_urls_data = json.loads(r.content) 
    # Grab the total number of products and round up to nearest 10 
    count = roundup(int(r.headers['count'])) 
    pages = (count/10) + 1 
    for x in start_urls_data: 
     yield scrapy.Request(x["link"], dont_filter=True) 

    for i in range (2, pages): 
     pg_url = 'https://www.foo.com/api/v1/product_urls?q%5Bcompany_in%5D%5B%5D=Fancy&page={0}'.format(i) 
     print pg_url 
     pg_r = requests.get(pg_url, headers=headers) 
     # Add remaining data to the JSON 
     additional_start_urls_data = json.loads(pg_r.content) 
     for x in additional_start_urls_data: 
     yield scrapy.Request(x["link"], dont_filter=True) 

    def parse(self, response): 
    item = FancyItem() 
    item['link'] = response.url 
    item['interest'] = response.xpath("//div[@class='frm']/div[@class='figure-button']/a[contains(@class, 'fancyd_list')]/text()").extract_first() 
    return item