從scrapy處理ImagesPipeline（或MediaPipeline）Retreive http返回代碼

我有一個工作的spider抓取圖像URL並將它們放置在scrapy.Item的image_urls字段中。我有一個從ImagesPipeline繼承的自定義管道。當一個特定的URL返回一個非200的http響應代碼（如說401錯誤）。例如，在日誌文件中，我發現從scrapy處理ImagesPipeline（或MediaPipeline）Retreive http返回代碼

WARNING:scrapy.pipelines.files:File (code: 404): Error downloading file from <GET http://a.espncdn.com/combiner/i%3Fimg%3D/i/headshots/tennis/players/full/425.png> referred in <None> 
WARNING:scrapy.pipelines.files:File (code: 307): Error downloading file from <GET http://www.fansshare.com/photos/rogerfederer/federer-roger-federer-406468306.jpg> referred in <None>

但是，我無法捕捉的錯誤代碼，等在item_completed()功能我的自定義圖像流水線：

def item_completed(self, results, item, info): 

    image_paths = [] 
    for download_status, x in results: 
     if download_status: 
      image_paths.append(x['path']) 
      item['images'] = image_paths # update item image path 
      item['result_download_status'] = 1 
     else: 
      item['result_download_status'] = 0 
      #x.printDetailedTraceback() 
      logging.info(repr(x)) # x is a twisted failure object 

    return item

在函數files.py內搜索scrapy源代碼，發現對於非200響應代碼，會記錄警告（這解釋了上述警告行），然後引發FileException。

if response.status != 200: 
     logger.warning(
      'File (code: %(status)s): Error downloading file from ' 
      '%(request)s referred in <%(referer)s>', 
      {'status': response.status, 
      'request': request, 'referer': referer}, 
      extra={'spider': info.spider} 
     ) 

     raise FileException('download-error')

我該如何訪問這個響應代碼，以便我可以在我的管道item_completed（）函數中處理它？

來源

2017-01-19 hAcKnRoCk

如果你對異步編程和Twisted回調和errbacks不熟悉，你可以很容易地將它與Scrapy的媒體管道中的所有方法混淆在一起，所以在你的情況下，基本想法是覆蓋media_downloaded這樣的方式來處理非像這樣-200響應（只是快速和骯髒的PoC）：

class MyPipeline(ImagesPipeline): 

    def media_downloaded(self, response, request, info): 
     if response.status != 200: 
      return {'url': request.url, 'status': response.status} 
     super(MyPipeline, self).media_downloaded(response, request, info) 

    def item_completed(self, results, item, info): 
     image_paths = [] 
     for download_status, x in results: 
      if download_status: 
       if not x.get('status', False): 
        # Successful download 
       else: 
        # x['status'] contains non-200 response code

來源

2017-01-20 10:34:47 mizhgun

由於內部處理你的答案。但在media_downloaded中，狀態碼始終爲200，因爲只有在下載成功（我想）時它纔會被調用。事實上，我嘗試過類似的方法。我重載了file_downloaded（）而不是media_downloaded（），因爲ImagesPipeline繼承自定義此方法的FilesPipeline。請參閱我的方法http://pastebin.com/bpLKyWYx。但是，我在item_completed（）中看不到200個狀態碼。我認爲這是因爲正如我在問題中提到的那樣，當發生非200狀態碼時會引發FileException。 – hAcKnRoCk

實際上'media_downloaded'不僅收到任何響應，而且還收到了200個響應。我們在上面的代碼中覆蓋默認的'media_downloaded'，檢查響應是否爲非200，如果是，則返回帶響應狀態的dict，否則調用parent方法ImagesPipeline - 因此，在引發異常之前，上面的代碼針對每個響應**運行。 – mizhgun

感謝您的指導。我想出最好的方法是捕獲異常並處理它，而不是隻爲非200響應調用super。儘管您的指導對於繼續並找出答案至關重要，但我會將我的方法作爲單獨的答案發布 – hAcKnRoCk

捕捉非200響應代碼的正確方法似乎是繼承media_downloaded但調用父類的功能，並捕獲異常。這裏是工作的代碼：

def media_downloaded(self, response, request, info): 
    try: 
     resultdict = super(MyPipeline, self).media_downloaded(response, request, info) 
     resultdict['status'] = response.status 
     logging.warning('No Exception : {}'.format(response.status)) 
     return resultdict 
    except FileException as exc: 
     logging.warning('Caused Exception : {} {}'.format(response.status, str(exc))) 
     return {'url': request.url, 'status': response.status}

響應代碼可以在item_completed（）

def item_completed(self, results, item, info): 
    image_paths = [] 
    for download_status, x in results: 
     if x.get('status', True): 
      item['result_download_status'] = x['status'] # contains non-200 response code 
      if x['status'] == 200: 
       image_paths.append(x['path']) 
       item['images'] = image_paths # update item image path

來源

2017-01-25 17:04:25 hAcKnRoCk

從scrapy處理ImagesPipeline（或MediaPipeline）Retreive http返回代碼

回答

相關問題