2016-11-21 39 views
5

我目前正在構建一個Web應用程序,旨在顯示scrapy蜘蛛收集的數據。用戶發出請求,蜘蛛抓取一個網站,然後將數據返回給應用程序以提示。我想直接從刮板中檢索數據,而不依賴於中介.csv或.json文件。例如:如何將scrapy爬行器中的數據保存到變量中?

from scrapy.crawler import CrawlerProcess 
from scraper.spiders import MySpider 

url = 'www.example.com' 
spider = MySpider() 
crawler = CrawlerProcess() 
crawler.crawl(spider, start_urls=[url]) 
crawler.start() 
data = crawler.data # this bit 

回答

6

這不是那麼容易,因爲Scrapy是非阻塞的並且在事件循環中工作;它使用Twisted事件循環,並且Twisted事件循環不可重新啓動,因此無法編寫crawler.start(); data = crawler.data - 在crawler.start()進程永久運行後,調用已註冊的回調函數,直到它被終止或終止。

這些答案可能是相關的:

如果您在您的應用程序中使用一個事件循環(例如,你有扭曲或龍捲風Web服務器),那麼它是可能從爬網中獲取數據而不將其存儲到磁盤。這個想法是聽item_scraped信號。我使用下面的幫助,使其更好:

import collections 

from twisted.internet.defer import Deferred 
from scrapy.crawler import Crawler 
from scrapy import signals 

def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs): 
    """ 
    Start a crawl and return an object (ItemCursor instance) 
    which allows to retrieve scraped items and wait for items 
    to become available. 

    Example: 

    .. code-block:: python 

     @inlineCallbacks 
     def f(): 
      runner = CrawlerRunner() 
      async_items = scrape_items(runner, my_spider) 
      while (yield async_items.fetch_next): 
       item = async_items.next_item() 
       # ... 
      # ... 

    This convoluted way to write a loop should become unnecessary 
    in Python 3.5 because of ``async for``. 
    """ 
    crawler = crawler_runner.create_crawler(crawler_or_spidercls)  
    d = crawler_runner.crawl(crawler, *args, **kwargs) 
    return ItemCursor(d, crawler) 


class ItemCursor(object): 
    def __init__(self, crawl_d, crawler): 
     self.crawl_d = crawl_d 
     self.crawler = crawler 

     crawler.signals.connect(self._on_item_scraped, signals.item_scraped) 

     crawl_d.addCallback(self._on_finished) 
     crawl_d.addErrback(self._on_error) 

     self.closed = False 
     self._items_available = Deferred() 
     self._items = collections.deque() 

    def _on_item_scraped(self, item): 
     self._items.append(item) 
     self._items_available.callback(True) 
     self._items_available = Deferred() 

    def _on_finished(self, result): 
     self.closed = True 
     self._items_available.callback(False) 

    def _on_error(self, failure): 
     self.closed = True 
     self._items_available.errback(failure) 

    @property 
    def fetch_next(self): 
     """ 
     A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to 
     asynchronously retrieve the next item, waiting for an item to be 
     crawled if necessary. Resolves to ``False`` if the crawl is finished, 
     otherwise :meth:`next_item` is guaranteed to return an item 
     (a dict or a scrapy.Item instance). 
     """ 
     if self.closed: 
      # crawl is finished 
      d = Deferred() 
      d.callback(False) 
      return d 

     if self._items: 
      # result is ready 
      d = Deferred() 
      d.callback(True) 
      return d 

     # We're active, but item is not ready yet. Return a Deferred which 
     # resolves to True if item is scraped or to False if crawl is stopped. 
     return self._items_available 

    def next_item(self): 
     """Get a document from the most recently fetched batch, or ``None``. 
     See :attr:`fetch_next`. 
     """ 
     if not self._items: 
      return None 
     return self._items.popleft() 

的API由motor,一個MongoDB的驅動程序異步框架啓發。使用scrape_items,您可以在扭曲或龍捲風回調中儘快獲取項目,這與您從MongoDB查詢中獲取項目的方式類似。