重要提示:所有可用在計算器上的那一刻答案是Scrapy的早期版本和不scrapy的最新版本1.4的工作的Python + Scrapy重命名下載的圖片
完全陌生的scrapy和蟒蛇,我試圖刮一些頁面,並下載圖像。正在下載圖片但它們仍然具有原始的SHA-1名稱作爲文件名。 我不知道如何重命名文件,他們實際上都有SHA-1文件名。
試圖將它們重命名爲「測試」,並且在運行scrapy crawl rambopics
以及url數據時,輸出中出現「測試」。但文件不會在目標文件夾中重命名。下面是輸出的一個樣本:
> 2017-06-11 00:27:06 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.theurl.com/> {'image_urls':
> ['https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg'],
> 'image_name': ['test'], 'title': ['test'], 'filename': ['test'],
> 'images': [{'url':
> 'https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg',
> 'path': 'full/fcbec9bf940b48c248213abe5cd2fa1c690cb879.jpg',
> 'checksum': '7be30d939a7250cc318e6ef18a6b0981'}]}
到目前爲止,我已經嘗試了許多不同的解決方案都貼在計算器,就是沒有明確的答案,在2017年的最新版本scrapy的那個問題,它看起來像命題可能幾乎都是過時的。我正在使用Python 3.6的Scrapy 1.4。
scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = rambopics.settings
[deploy]
#url = http://localhost:6800/
project = rambopics
items.py 進口scrapy
class RambopicsItem(scrapy.Item):
# defining items:
image_urls = scrapy.Field()
images = scrapy.Field()
image_name = scrapy.Field()
title = scrapy.Field()
#pass -- dont realy understand what pass is for
settings.py
BOT_NAME = 'rambopics'
SPIDER_MODULES = ['rambopics.spiders']
NEWSPIDER_MODULE = 'rambopics.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "W:/scraped/"
pipelines.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class RambopicsPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item['img_url']
meta = {
'filename': item['title'],
'title': item['image_name']
}
yield Request(url=img_url, meta=meta)
(蜘蛛)rambopics.py
from rambopics.items import RambopicsItem
from scrapy.selector import Selector
import scrapy
class RambopicsSpider(scrapy.Spider):
name = 'rambopics'
allowed_domains = ['theurl.com']
start_urls = ['http://www.theurl.com/']
def parse(self, response):
for sel in response.xpath('/html'):
#img_name = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
img_name = 'test'
#img_title = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
img_title = 'test'
for elem in response.xpath("//div[contains(@class, 'entry-content')]"):
img_url = elem.xpath("a/@href").extract_first()
yield {
'image_urls': [img_url],
'image_name': [img_name],
'title': [img_title],
'filename': [img_name]
}
注意,我不知道怎樣纔是正確的META NAME來使用是最終下載的文件名稱(我不確定它是否是文件名,圖像名稱或標題)。
對這一問題的回答已經在網站上瀏覽:https://stackoverflow.com/a/30002870/1675954這裏:https://stackoverflow.com/a/6196180/1675954他們都是相當全面答案。檢查您的基本設置是否正確配置。他們需要在爬行之前設置。請參閱https://doc.scrapy.org/en/latest/topics/api.html#scrapy.settings.BaseSettings.set –
可能的重複[在Scrapy 0.24中重命名下載的圖像與來自項目字段的內容,同時避免文件名衝突? ](https://stackoverflow.com/questions/29946989/renaming-downloaded-images-in-scrapy-0-24-with-content-from-an-item-field-while) –
請詳細解釋什麼是不與其他解決方案一起工作有沒有具體的錯誤? –