2017-06-29 74 views
1

我以前問過類似的問題(How does Scrapy avoid re-downloading media that was downloaded recently?),但由於我沒有收到明確的答案,我會再次提問。如何避免在Scrapy中重新下載媒體到S3?

我已經使用Scrapy的文件管道將大量文件下載到AWS S3存儲桶。根據文檔(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images),此管道避免了「重新下載最近下載的介質」,但它沒有說明「最近」多久之前或如何設置此參數。

望着在https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.pyFilesPipeline類的實現,它會出現,這是從FILES_EXPIRES設置,其默認爲90天獲得:

class FilesPipeline(MediaPipeline): 
    """Abstract pipeline that implement the file downloading 
    This pipeline tries to minimize network transfers and file processing, 
    doing stat of the files and determining if file is new, uptodate or 
    expired. 
    `new` files are those that pipeline never processed and needs to be 
     downloaded from supplier site the first time. 
    `uptodate` files are the ones that the pipeline processed and are still 
     valid files. 
    `expired` files are those that pipeline already processed but the last 
     modification was made long time ago, so a reprocessing is recommended to 
     refresh it in case of change. 
    """ 

    MEDIA_NAME = "file" 
    EXPIRES = 90 
    STORE_SCHEMES = { 
     '': FSFilesStore, 
     'file': FSFilesStore, 
     's3': S3FilesStore, 
    } 
    DEFAULT_FILES_URLS_FIELD = 'file_urls' 
    DEFAULT_FILES_RESULT_FIELD = 'files' 

    def __init__(self, store_uri, download_func=None, settings=None): 
     if not store_uri: 
      raise NotConfigured 

     if isinstance(settings, dict) or settings is None: 
      settings = Settings(settings) 

     cls_name = "FilesPipeline" 
     self.store = self._get_store(store_uri) 
     resolve = functools.partial(self._key_for_pipe, 
            base_class_name=cls_name, 
            settings=settings) 
     self.expires = settings.getint(
      resolve('FILES_EXPIRES'), self.EXPIRES 
     ) 
     if not hasattr(self, "FILES_URLS_FIELD"): 
      self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD 
     if not hasattr(self, "FILES_RESULT_FIELD"): 
      self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD 
     self.files_urls_field = settings.get(
      resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD 
     ) 
     self.files_result_field = settings.get(
      resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD 
     ) 

     super(FilesPipeline, self).__init__(download_func=download_func, settings=settings) 

    @classmethod 
    def from_settings(cls, settings): 
     s3store = cls.STORE_SCHEMES['s3'] 
     s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID'] 
     s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY'] 
     s3store.POLICY = settings['FILES_STORE_S3_ACL'] 

     store_uri = settings['FILES_STORE'] 
     return cls(store_uri, settings=settings) 

    def _get_store(self, uri): 
     if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir 
      scheme = 'file' 
     else: 
      scheme = urlparse(uri).scheme 
     store_cls = self.STORE_SCHEMES[scheme] 
     return store_cls(uri) 

    def media_to_download(self, request, info): 
     def _onsuccess(result): 
      if not result: 
       return # returning None force download 

      last_modified = result.get('last_modified', None) 
      if not last_modified: 
       return # returning None force download 

      age_seconds = time.time() - last_modified 
      age_days = age_seconds/60/60/24 
      if age_days > self.expires: 
       return # returning None force download 

我是否正確地理解這一點?另外,我在S3FilesStore類中看不到類似的布爾語句age_days;是否也對S3上的文件執行了年齡檢查? (我也無法找到任何測試S3的年齡檢查功能的測試)。

回答

1

FILES_EXPIRES確實設置告訴FilesPipeline如何 「老」 可以在一個文件是(再次)下載它。

代碼的關鍵部分是在media_to_download: 的_onsuccess回調檢查管道的self.store.stat_file調用的結果,而對於你的問題,它尤其會尋找「LAST_MODIFIED」信息。如果上次修改時間比「過期天數」早,則會觸發下載。

您可以檢查how the S3store gets the "last modified" information。這取決於botocore是否可用。

+0

謝謝,我已經驗證過,這也適用於S3存儲中的文件。也許最好記錄下'FILES_EXPIRES'設置(例如在https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images)?例如在我的情況下,我希望它不是90天,而是實際上無限的價值。 –

1

對此的一個答案是 - class FilesPipeline(MediaPipeline):是唯一負責管理,驗證和下載本地路徑中的文件的類。 class S3FilesStore(object):只是從本地路徑獲取文件並將其上傳到S3。

class FSFilesStore是一個管理所有本地路徑和FilesPipeline使用它們來存儲您的文件在本地。

鏈接:

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L264 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L397 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L299