2017-04-15 184 views
0

我有兩個蜘蛛需要一個主蜘蛛抓取網址和數據。我的做法是在主蜘蛛中使用CrawlerProcess並將數據傳遞給兩個蜘蛛。這裏是我的方法:Scrapy從主蜘蛛運行多個蜘蛛?

class LightnovelSpider(scrapy.Spider): 

    name = "novelDetail" 
    allowed_domains = ["readlightnovel.com"] 

    def __init__(self,novels = []): 
     self.novels = novels 

    def start_requests(self): 
     for novel in self.novels: 
      self.logger.info(novel) 
      request = scrapy.Request(novel, callback=self.parseNovel) 
      yield request 

    def parseNovel(self, response): 
     #stuff here 

class chapterSpider(scrapy.Spider): 
    name = "chapters" 
    #not done here 

class initCrawler(scrapy.Spider): 
    name = "main" 
    fromMongo = {} 
    toChapter = {} 
    toNovel = [] 
    fromScraper = [] 


    def start_requests(self): 
     urls = ['http://www.readlightnovel.com/novel-list'] 

     for url in urls: 
      yield scrapy.Request(url=url,callback=self.parse) 

    def parse(self,response): 

     for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract(): 
      initCrawler.fromScraper.append(novel) 

     self.checkchanged() 

    def checkchanged(self): 
     #some scraped data processing here 
     self.dispatchSpiders() 

    def dispatchSpiders(self): 
     process = CrawlerProcess() 
     novelSpider = LightnovelSpider() 
     process.crawl(novelSpider,novels=initCrawler.toNovel) 
     process.start() 
     self.logger.info("Main Spider Finished") 

我跑「scrapy爬行主」,並得到一個美麗的錯誤enter image description here

主要的錯誤,我可以看到的是一個「twisted.internet.error.ReactorAlreadyRunning」。我不知道。有更好的方法從另一個蜘蛛運行多個蜘蛛和/或我怎樣才能阻止這個錯誤?

回答

0

一個一些研究,我能夠通過使用屬性裝飾「@property」來檢索主蜘蛛數據這樣來解決這個問題後:

class initCrawler(scrapy.Spider): 

    #stuff here from question 

    @property 
    def getNovel(self): 
     return self.toNovel 

    @property 
    def getChapter(self): 
     return self.toChapter 

然後使用CrawlerRunner這樣的:

from spiders.lightnovel import chapterSpider,lightnovelSpider,initCrawler 
from scrapy.crawler import CrawlerProcess,CrawlerRunner 
from twisted.internet import reactor, defer 
from scrapy.utils.log import configure_logging 
import logging 

configure_logging() 

runner = CrawlerRunner() 

@defer.inlineCallbacks 
def crawl(): 
    yield runner.crawl(initCrawler) 
    toNovel = initCrawler.toNovel 
    toChapter = initCrawler.toChapter 
    yield runner.crawl(chapterSpider,chapters=toChapter) 
    yield runner.crawl(lightnovelSpider,novels=toNovel) 

    reactor.stop() 

crawl() 
reactor.run() 
1

哇,不知道這樣的東西可以工作,但我從來沒有嘗試過。

我在做什麼,而不是當多個刮階段必須攜手合作是這兩個任一選項:

選項1 - 使用數據庫

當刮刀要跑在一個連續的模式下,重新掃描網站等,我只是讓刮板將其結果推入數據庫(通過管道)

而且後續處理的蜘蛛會從相同的數據庫中提取他們需要的數據(在你的情況下,例如小說網址)。

然後使用調度程序或cron保持一切運行,蜘蛛將攜手並進。

選擇2 - 合併一切都變成一個蜘蛛

這就是我選擇當一切都需要運行爲一體腳本的方式:我創建了多個連鎖請求一起幾步一個蜘蛛。

class LightnovelSpider(scrapy.Spider): 

    name = "novels" 
    allowed_domains = ["readlightnovel.com"] 

    # was initCrawler.start_requests 
    def start_requests(self): 
     urls = ['http://www.readlightnovel.com/novel-list'] 

     for url in urls: 
      yield scrapy.Request(url=url,callback=self.parse_novel_list) 

    # a mix of initCrawler.parse and parts of LightnovelScraper.start_requests 
    def parse_novel_list(self,response): 
     for novel in response.xpath('//div[@class="list-by-word-body"]/ul/li/a/@href[not(@href="#")]').extract(): 
      yield Request(novel, callback=self.parse_novel) 

    def parse_novel(self, response): 
     #stuff here 
     # ... and create requests with callback=self.parse_chapters 

    def parse_chapters(self, response): 
     # do stuff 

(代碼沒有進行測試,它只是顯示的基本概念)

如果事情變得太複雜,我拉了一些元素,並將它們轉移到混入類。

在你的情況我將最有可能傾向於選擇2.