Scrapy從主蜘蛛運行多個蜘蛛？

我有兩個蜘蛛需要一個主蜘蛛抓取網址和數據。我的做法是在主蜘蛛中使用CrawlerProcess並將數據傳遞給兩個蜘蛛。這裏是我的方法：Scrapy從主蜘蛛運行多個蜘蛛？

class LightnovelSpider(scrapy.Spider): 

    name = "novelDetail" 
    allowed_domains = ["readlightnovel.com"] 

    def __init__(self,novels = []): 
     self.novels = novels 

    def start_requests(self): 
     for novel in self.novels: 
      self.logger.info(novel) 
      request = scrapy.Request(novel, callback=self.parseNovel) 
      yield request 

    def parseNovel(self, response): 
     #stuff here 

class chapterSpider(scrapy.Spider): 
    name = "chapters" 
    #not done here 

class initCrawler(scrapy.Spider): 
    name = "main" 
    fromMongo = {} 
    toChapter = {} 
    toNovel = [] 
    fromScraper = [] 


    def start_requests(self): 
     urls = ['http://www.readlightnovel.com/novel-list'] 

     for url in urls: 
      yield scrapy.Request(url=url,callback=self.parse) 

    def parse(self,response): 

     for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract(): 
      initCrawler.fromScraper.append(novel) 

     self.checkchanged() 

    def checkchanged(self): 
     #some scraped data processing here 
     self.dispatchSpiders() 

    def dispatchSpiders(self): 
     process = CrawlerProcess() 
     novelSpider = LightnovelSpider() 
     process.crawl(novelSpider,novels=initCrawler.toNovel) 
     process.start() 
     self.logger.info("Main Spider Finished")

我跑「scrapy爬行主」，並得到一個美麗的錯誤

主要的錯誤，我可以看到的是一個「twisted.internet.error.ReactorAlreadyRunning」。我不知道。有更好的方法從另一個蜘蛛運行多個蜘蛛和/或我怎樣才能阻止這個錯誤？

來源

2017-04-15 scroobius

一個一些研究，我能夠通過使用屬性裝飾「@property」來檢索主蜘蛛數據這樣來解決這個問題後：

class initCrawler(scrapy.Spider): 

    #stuff here from question 

    @property 
    def getNovel(self): 
     return self.toNovel 

    @property 
    def getChapter(self): 
     return self.toChapter

然後使用CrawlerRunner這樣的：

from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler 
from scrapy.crawler import CrawlerProcess,CrawlerRunner 
from twisted.internet import reactor, defer 
from scrapy.utils.log import configure_logging 
import logging 

configure_logging() 

runner = CrawlerRunner() 

@defer.inlineCallbacks 
def crawl(): 
    yield runner.crawl(initCrawler) 
    toNovel = initCrawler.toNovel 
    toChapter = initCrawler.toChapter 
    yield runner.crawl(chapterSpider,chapters=toChapter) 
    yield runner.crawl(lightnovelSpider,novels=toNovel) 

    reactor.stop() 

crawl() 
reactor.run()

來源

2017-04-16 00:50:59 scroobius

哇，不知道這樣的東西可以工作，但我從來沒有嘗試過。

我在做什麼，而不是當多個刮階段必須攜手合作是這兩個任一選項：

選項1 - 使用數據庫

當刮刀要跑在一個連續的模式下，重新掃描網站等，我只是讓刮板將其結果推入數據庫（通過管道）

而且後續處理的蜘蛛會從相同的數據庫中提取他們需要的數據（在你的情況下，例如小說網址）。

然後使用調度程序或cron保持一切運行，蜘蛛將攜手並進。

選擇2 - 合併一切都變成一個蜘蛛

這就是我選擇當一切都需要運行爲一體腳本的方式：我創建了多個連鎖請求一起幾步一個蜘蛛。

class LightnovelSpider(scrapy.Spider): 

    name = "novels" 
    allowed_domains = ["readlightnovel.com"] 

    # was initCrawler.start_requests 
    def start_requests(self): 
     urls = ['http://www.readlightnovel.com/novel-list'] 

     for url in urls: 
      yield scrapy.Request(url=url,callback=self.parse_novel_list) 

    # a mix of initCrawler.parse and parts of LightnovelScraper.start_requests 
    def parse_novel_list(self,response): 
     for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract(): 
      yield Request(novel, callback=self.parse_novel) 

    def parse_novel(self, response): 
     #stuff here 
     # ... and create requests with callback=self.parse_chapters 

    def parse_chapters(self, response): 
     # do stuff

（代碼沒有進行測試，它只是顯示的基本概念）

如果事情變得太複雜，我拉了一些元素，並將它們轉移到混入類。

在你的情況我將最有可能傾向於選擇2.

來源

2017-04-15 12:37:11 rrschmidt

Scrapy從主蜘蛛運行多個蜘蛛？

回答

相關問題