2017-01-19 122 views
0

我有一個工作的spider抓取圖像URL並將它們放置在scrapy.Item的image_urls字段中。我有一個從ImagesPipeline繼承的自定義管道。當一個特定的URL返回一個非200的http響應代碼(如說401錯誤)。例如,在日誌文件中,我發現從scrapy處理ImagesPipeline(或MediaPipeline)Retreive http返回代碼

WARNING:scrapy.pipelines.files:File (code: 404): Error downloading file from <GET http://a.espncdn.com/combiner/i%3Fimg%3D/i/headshots/tennis/players/full/425.png> referred in <None> 
WARNING:scrapy.pipelines.files:File (code: 307): Error downloading file from <GET http://www.fansshare.com/photos/rogerfederer/federer-roger-federer-406468306.jpg> referred in <None> 

但是,我無法捕捉的錯誤代碼,等在item_completed()功能我的自定義圖像流水線:

def item_completed(self, results, item, info): 

    image_paths = [] 
    for download_status, x in results: 
     if download_status: 
      image_paths.append(x['path']) 
      item['images'] = image_paths # update item image path 
      item['result_download_status'] = 1 
     else: 
      item['result_download_status'] = 0 
      #x.printDetailedTraceback() 
      logging.info(repr(x)) # x is a twisted failure object 

    return item 

在函數files.py內搜索scrapy源代碼,發現對於非200響應代碼,會記錄警告(這解釋了上述警告行),然後引發FileException

if response.status != 200: 
     logger.warning(
      'File (code: %(status)s): Error downloading file from ' 
      '%(request)s referred in <%(referer)s>', 
      {'status': response.status, 
      'request': request, 'referer': referer}, 
      extra={'spider': info.spider} 
     ) 

     raise FileException('download-error') 

我該如何訪問這個響應代碼,以便我可以在我的管道item_completed()函數中處理它?

回答

1

如果你對異步編程和Twisted回調和errbacks不熟悉,你可以很容易地將它與Scrapy的媒體管道中的所有方法混淆在一起,所以在你的情況下,基本想法是覆蓋media_downloaded這樣的方式來處理非像這樣-200響應(只是快速和骯髒的PoC):

class MyPipeline(ImagesPipeline): 

    def media_downloaded(self, response, request, info): 
     if response.status != 200: 
      return {'url': request.url, 'status': response.status} 
     super(MyPipeline, self).media_downloaded(response, request, info) 

    def item_completed(self, results, item, info): 
     image_paths = [] 
     for download_status, x in results: 
      if download_status: 
       if not x.get('status', False): 
        # Successful download 
       else: 
        # x['status'] contains non-200 response code 
+0

由於內部處理你的答案。但在media_downloaded中,狀態碼始終爲200,因爲只有在下載成功(我想)時它纔會被調用。事實上,我嘗試過類似的方法。我重載了file_downloaded()而不是media_downloaded(),因爲ImagesPipeline繼承自定義此方法的FilesPipeline。請參閱我的方法http://pastebin.com/bpLKyWYx。但是,我在item_completed()中看不到200個狀態碼。我認爲這是因爲正如我在問題中提到的那樣,當發生非200狀態碼時會引發FileException。 – hAcKnRoCk

+0

實際上'media_downloaded'不僅收到任何響應,而且還收到了200個響應。我們在上面的代碼中覆蓋默認的'media_downloaded',檢查響應是否爲非200,如果是,則返回帶響應狀態的dict,否則調用parent方法ImagesPipeline - 因此,在引發異常之前,上面的代碼針對每個響應**運行。 – mizhgun

+0

感謝您的指導。我想出最好的方法是捕獲異常並處理它,而不是隻爲非200響應調用super。儘管您的指導對於繼續並找出答案至關重要,但我會將我的方法作爲單獨的答案發布 – hAcKnRoCk

0

捕捉非200響應代碼的正確方法似乎是繼承media_downloaded但調用父類的功能,並捕獲異常。這裏是工作的代碼:

def media_downloaded(self, response, request, info): 
    try: 
     resultdict = super(MyPipeline, self).media_downloaded(response, request, info) 
     resultdict['status'] = response.status 
     logging.warning('No Exception : {}'.format(response.status)) 
     return resultdict 
    except FileException as exc: 
     logging.warning('Caused Exception : {} {}'.format(response.status, str(exc))) 
     return {'url': request.url, 'status': response.status} 

響應代碼可以在item_completed()

def item_completed(self, results, item, info): 
    image_paths = [] 
    for download_status, x in results: 
     if x.get('status', True): 
      item['result_download_status'] = x['status'] # contains non-200 response code 
      if x['status'] == 200: 
       image_paths.append(x['path']) 
       item['images'] = image_paths # update item image path