0
我正在使用Scrapy抓取圖像的Tumblr。刮板似乎能夠刮掉圖像的網址,但不能下載它們。Scrapy tumblr刮板未保存圖像
settings.py
BOT_NAME = 'tumblr'
SPIDER_MODULES = ['tumblr.spiders']
NEWSPIDER_MODULE = 'tumblr.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:\Users\123\Desktop'
items.py
import scrapy
class TumblrItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
tumblr_spider
import scrapy
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from tumblr.items import TumblrItem
class TumblrSpider(CrawlSpider):
name = 'tumblr'
allowed_domains = ['tumblr.com']
start_urls = ['http://free-indie-games.tumblr.com/archive']
rules = [Rule(LinkExtractor(allow=['/post']), 'parse_imgur')]
def parse_imgur(self, response):
image = TumblrItem()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = ['http:'+rel[0]]
return image
日誌(這是很長,所以我會在這裏把它位)
2015-10-17 17:43:59 [scrapy] DEBUG: Scraped from <200 http://free-indie- games.tumblr.com/post/63142153501>
{'image_urls': [u'http:http://38.media.tumblr.com/avatar_0c4d1dcedfcd_128.png'],
'images': []}
2015-10-17 17:44:00 [scrapy] INFO: Closing spider (finished)
2015-10-17 17:44:00 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.ConnectError': 3,
'downloader/request_bytes': 8356,
'downloader/request_count': 29,
'downloader/request_method_count/GET': 29,
'downloader/response_bytes': 295766,
'downloader/response_count': 26,
'downloader/response_status_count/200': 26,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 17, 16, 44, 0, 951000),
'item_scraped_count': 25,
'log_count/DEBUG': 55,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 26,
'scheduler/dequeued': 26,
'scheduler/dequeued/memory': 26,
'scheduler/enqueued': 26,
'scheduler/enqueued/memory': 26,
'start_time': datetime.datetime(2015, 10, 17, 16, 43, 58, 83000)}
2015-10-17 17:44:00 [scrapy] INFO: Spider closed (finished)
對我來說,它似乎刮擦了網址,但不要下載圖像。至少在電腦中什麼也沒有顯示出來。
任何想法?
它確實確實看起來這兩個'http'錯了。然而urljoin給了我一個錯誤:「NameError:全局名稱'urljoin'未定義」 我以爲它是一個預定義的變量? – Jomasdf
Nvm,忘記了導入。現在它有點作品。只是沒有刮正確的東西。 – Jomasdf
我認爲我的xpath是錯誤的,我正在刮像頭像和一些圖標,但不是帖子中的圖像。 – Jomasdf