Scrapy：等待一些網址被解析，然後做點什麼

我有一隻需要找到產品價格的蜘蛛。這些產品成批地分組在一起（來自數據庫），並且具有批處理狀態（RUNNING，DONE）以及start_time和finished_time屬性會很好。所以我有這樣的：Scrapy：等待一些網址被解析，然後做點什麼

class PriceSpider(scrapy.Spider): 
    name = 'prices' 

    def start_requests(self): 
     for batch in Batches.objects.all(): 
      batch.started_on = datetime.now() 
      batch.status = 'RUNNING' 
      batch.save() 
      for prod in batch.get_products(): 
       yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod}) 
      batch.status = 'DONE' 
      batch.finished_on = datetime.now() 
      batch.save() # <-- NOT COOL: This is goind to 
          # execute before the last product 
          # url is scraped, right? 

    def parse(self, response): 
     #...

這裏的問題是由於scrapy的異步性質，批次對象的第二狀態更新將會太快運行，對吧？有沒有辦法將這些請求以某種方式組合在一起，並在最後一個被分析時更新批處理對象？

來源

2017-02-14 Tony Lâmpada

我對@Umair提出了一些修改建議離子，想出了的偉大工程，爲我的情況的解決方案：

class PriceSpider(scrapy.Spider): 
    name = 'prices' 

    def start_requests(self): 
     for batch in Batches.objects.all(): 
      batch.started_on = datetime.now() 
      batch.status = 'RUNNING' 
      batch.save() 
      products = batch.get_products() 
      counter = {'curr': 0, 'total': len(products)} # the counter dictionary 
                  # for this batch 
      for prod in products: 
       yield scrapy.Request(product.get_scrape_url(), 
            meta={'prod': prod, 
              'batch': batch, 
              'counter': counter}) 
            # trick = add the counter in the meta dict 

    def parse(self, response): 
     # process the response as desired 
     batch = response.meta['batch'] 
     counter = response.meta['counter'] 
     self.increment_counter(batch, counter) # increment counter only after 
               # the work is done 

    def increment_counter(batch, counter): 
     counter['curr'] += 1 
     if counter['curr'] == counter['total']: 
      batch.status = 'DONE' 
      batch.finished_on = datetime.now() 
      batch.save() # GOOD! 
          # Well, almost...

這隻要通過start_requests產生的全部請求具有不同的URL的正常工作。

如果有任何重複，scrapy將過濾出來，不要讓你的parse方法，所以你最終counter['curr'] < counter['total']和批次狀態保持運行，直到永遠。

事實證明，您可以覆蓋scrapy的重複行爲。

首先，我們需要改變settings.py指定備用「重複過濾器」類：

DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'

然後我們創建MyDupeFilter類，讓蜘蛛知道什麼時候有一個重複：

class MyDupeFilter(RFPDupeFilter): 
    def log(self, request, spider): 
     super(MyDupeFilter, self).log(request, spider) 
     spider.look_a_dupe(request)

然後我們修改我們的蜘蛛，使其增加計數器時重複發現：

class PriceSpider(scrapy.Spider): 
    name = 'prices' 

    #... 

    def look_a_dupe(self, request): 
     batch = request.meta['batch'] 
     counter = request.meta['counter'] 
     self.increment_counter(batch, counter)

我們很好走

來源

2017-02-21 20:13:30

對於這種交易，您可以使用signal closed，您可以綁定一個函數以在蜘蛛完成爬網時運行。

來源

2017-02-14 13:40:57

有趣的是，我看到這些信號可能是有用的。在這種情況下，雖然可能「關閉」不是正確的（因爲蜘蛛會處理多個批次，理想情況下我想知道每個批次的完成時間） –

這是欺騙

每個請求，發送batch_id，total_products_in_this_batch和processed_this_batch

，任何地點以任何功能檢查

for batch in Batches.objects.all(): 
    processed_this_batch = 0 
    # TODO: Get some batch_id here 
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch` 

    for prod in batch.get_products(): 
     processed_this_batch = processed_this_batch + 1 
     yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

而且在任何地方的代碼，對任何特定批次，檢驗if processed_this_batch == total_products_in_this_batch然後保存批處理

來源

2017-02-14 19:09:39 Umair

這看起來確實是一個好主意。將測試，謝謝！ –

它並沒有完全按照你的建議工作（我必須在'parse'方法中增加計數器，如果我在這樣做之前做了這個請求，我最終會在批處理完成之前就完成標記）。但是你的建議DID指向了正確的方向，所以非常感謝！ –

順便說一句，我結束了我的完整解決方案回答這個問題 –

Scrapy：等待一些網址被解析，然後做點什麼

回答

相關問題