2016-06-08 115 views
2

我想使用Scrapy來抓取一些動態內容。 我已成功設置了Splash與其一起工作。 但是,下面的蜘蛛產量空結果的選擇:Scrapy選擇器不能在Splash響應

# -*- coding: utf-8 -*- 

import scrapy 
from scrapy.selector import Selector 
from scrapy_splash import SplashRequest 

class CartierSpider(scrapy.Spider): 
    name = 'cartier' 
    start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html'] 

    def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(url, self.parse, args={'wait': 0.5}) 

    def parse(self, response): 
    yield { 
     'title': response.xpath('//title').extract(), 
     'link': response.url, 
     'productID': Selector(text=response.body).xpath('//span[@itemprop="productID"]/text()').extract(), 
     'model': Selector(text=response.body).xpath('//span[@itemprop="model"]/text()').extract(), 
     'price': Selector(text=response.body).css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract(), 
    } 

的選擇只是正常工作使用Scrapy外殼,所以我什麼是不工作很困惑。

我可以在兩種情況中找到的唯一區別是字符串response.body的編碼的處理方式不同:如果我嘗試在parse函數內打印/解碼它,這只是亂碼。

任何提示或引用將不勝感激。

+1

無需創建'Selector(text = response.body)',您已經知道'response'作爲選擇器 – eLRuLL

+0

另外,您爲什麼要返回字典而不是Scrapy項目實例? – alecxe

+0

@eLRuLL我認爲這是必要的,因爲由'splash'解析的網頁包含在'response.body'中,並且這是一個'str'。我想你是對的。 –

回答

3

你的蜘蛛工作正常用我,用Scrapy 1.1,飛濺2.1和你的問題的代碼的任何修改,只是用在https://github.com/scrapy-plugins/scrapy-splash

至於其他建議的設置都提到,你parse功能可以簡化使用直接從response.css()response.xpath(),而不需要從響應中重新構建Selector

我試着用:

import scrapy 
from scrapy.selector import Selector 
from scrapy_splash import SplashRequest 

class CartierSpider(scrapy.Spider): 
    name = 'cartier' 
    start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html'] 

    def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(url, self.parse, args={'wait': 0.5}) 

    def parse(self, response): 
    yield { 
     'title': response.xpath('//title/text()').extract_first(), 
     'link': response.url, 
     'productID': response.xpath('//span[@itemprop="productID"]/text()').extract_first(), 
     'model': response.xpath('//span[@itemprop="model"]/text()').extract_first(), 
     'price': response.css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract_first(), 
    } 

,並得到這個:

$ scrapy crawl cartier 
2016-06-08 17:16:08 [scrapy] INFO: Scrapy 1.1.0 started (bot: stack37701774) 
2016-06-08 17:16:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack37701774.spiders', 'SPIDER_MODULES': ['stack37701774.spiders'], 'BOT_NAME': 'stack37701774'} 
(...) 
2016-06-08 17:16:08 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy_splash.SplashCookiesMiddleware', 
'scrapy_splash.SplashMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-06-08 17:16:08 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-06-08 17:16:08 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-06-08 17:16:08 [scrapy] INFO: Spider opened 
2016-06-08 17:16:08 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-06-08 17:16:08 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-06-08 17:16:11 [scrapy] DEBUG: Crawled (200) <GET http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html via http://localhost:8050/render.html> (referer: None) 
2016-06-08 17:16:11 [scrapy] DEBUG: Scraped from <200 http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html> 
{'model': u'Ballon Bleu de Cartier watch', 'productID': u'W69017Z4', 'link': 'http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html', 'price': None, 'title': u'CRW69017Z4 - Ballon Bleu de Cartier watch - 36 mm, steel, leather - Cartier'} 
2016-06-08 17:16:11 [scrapy] INFO: Closing spider (finished) 
2016-06-08 17:16:11 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 618, 
'downloader/request_count': 1, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 213006, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 6, 8, 15, 16, 11, 201281), 
'item_scraped_count': 1, 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'splash/render.html/request_count': 1, 
'splash/render.html/response_count/200': 1, 
'start_time': datetime.datetime(2016, 6, 8, 15, 16, 8, 545105)} 
2016-06-08 17:16:11 [scrapy] INFO: Spider closed (finished) 
+1

謝謝!顯然這是一個配置錯誤的問題,而代碼雖然很醜,但工作正常;遵循建議的'scrapy-splash'配置程序再次解決了這個問題。 –

+1

@PaoloBrasolin什麼「錯誤配置」問題在這裏起到了作用?我很好奇,所以我可以檢測到我的問題是否一樣。 –

+0

@RafaelAlmeida事實上,我無法追溯前一個設置中的精確錯誤:我剛剛初始化一個新的scrapy項目,使用scrapy-splash指南配置它並重新啓動了splash服務器。新鮮開始後,一切正常。 –

2

我試過SplashRequest,遇到了同樣的問題。在搞亂之後,我嘗試執行一個LUA腳本。

script = """ 
       function main(splash) 
    local url = splash.args.url 
    assert(splash:go(url)) 
    assert(splash:wait(0.5)) 
    return { 
    html = splash:html(), 
    png = splash:png(), 
    har = splash:har(), 
    } 
end 
""" 

然後用腳本作爲參數發出請求。你可以搞砸腳本。在localhost:9200或您選擇的另一個端口的shell上測試它。

yield SplashRequest(
      url, 
      self.parse, args={'lua_source': self.script}, endpoint='execute') 

哦,順便說一句,你收益率信息的方式很奇怪,用物品代替。

+0

謝謝!我會稍微等一下,看看有沒有人在接受你的答案之前解釋了什麼不起作用。 –

1

我沒有足夠的聲譽,以添加評論,所以我必須做出一個答案在這裏。

如果我爲SplashRequest設置'Accept-Encoding': 'gzip',我正面臨與Splash 2.1類似的問題,返回「格式錯誤」(未壓縮的實際拼寫)HTML內容。

最後,我發現這裏的解決方案:https://github.com/scrapinghub/splash/pull/102 變化'Accept-Encoding': 'gzip'到: 'Accept-Encoding': 'deflate'

我不知道爲什麼,但它的作品。

+0

感謝您的意見! –