我以前問過類似的問題(How does Scrapy avoid re-downloading media that was downloaded recently?),但由於我沒有收到明確的答案,我會再次提問。如何避免在Scrapy中重新下載媒體到S3?
我已經使用Scrapy的文件管道將大量文件下載到AWS S3存儲桶。根據文檔(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images),此管道避免了「重新下載最近下載的介質」,但它沒有說明「最近」多久之前或如何設置此參數。
望着在https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py的FilesPipeline
類的實現,它會出現,這是從FILES_EXPIRES
設置,其默認爲90天獲得:
class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""
MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
@classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']
store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)
def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds/60/60/24
if age_days > self.expires:
return # returning None force download
我是否正確地理解這一點?另外,我在S3FilesStore
類中看不到類似的布爾語句age_days
;是否也對S3上的文件執行了年齡檢查? (我也無法找到任何測試S3的年齡檢查功能的測試)。
謝謝,我已經驗證過,這也適用於S3存儲中的文件。也許最好記錄下'FILES_EXPIRES'設置(例如在https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images)?例如在我的情況下,我希望它不是90天,而是實際上無限的價值。 –