Scrapy選擇器不能在Splash響應

我想使用Scrapy來抓取一些動態內容。我已成功設置了Splash與其一起工作。但是，下面的蜘蛛產量空結果的選擇：Scrapy選擇器不能在Splash響應

# -*- coding: utf-8 -*- 

import scrapy 
from scrapy.selector import Selector 
from scrapy_splash import SplashRequest 

class CartierSpider(scrapy.Spider): 
    name = 'cartier' 
    start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html'] 

    def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(url, self.parse, args={'wait': 0.5}) 

    def parse(self, response): 
    yield { 
     'title': response.xpath('//title').extract(), 
     'link': response.url, 
     'productID': Selector(text=response.body).xpath('//span[@itemprop="productID"]/text()').extract(), 
     'model': Selector(text=response.body).xpath('//span[@itemprop="model"]/text()').extract(), 
     'price': Selector(text=response.body).css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract(), 
    }

的選擇只是正常工作使用Scrapy外殼，所以我什麼是不工作很困惑。

我可以在兩種情況中找到的唯一區別是字符串response.body的編碼的處理方式不同：如果我嘗試在parse函數內打印/解碼它，這只是亂碼。

任何提示或引用將不勝感激。

來源

2016-06-08 Paolo Brasolin

無需創建'Selector（text = response.body）'，您已經知道'response'作爲選擇器 – eLRuLL

另外，您爲什麼要返回字典而不是Scrapy項目實例？ – alecxe

@eLRuLL我認爲這是必要的，因爲由'splash'解析的網頁包含在'response.body'中，並且這是一個'str'。我想你是對的。 –

你的蜘蛛工作正常用我，用Scrapy 1.1，飛濺2.1和你的問題的代碼的任何修改，只是用在https://github.com/scrapy-plugins/scrapy-splash

至於其他建議的設置都提到，你parse功能可以簡化使用直接從response.css()和response.xpath()，而不需要從響應中重新構建Selector。

我試着用：

import scrapy 
from scrapy.selector import Selector 
from scrapy_splash import SplashRequest 

class CartierSpider(scrapy.Spider): 
    name = 'cartier' 
    start_urls = ['http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html'] 

    def start_requests(self): 
    for url in self.start_urls: 
     yield SplashRequest(url, self.parse, args={'wait': 0.5}) 

    def parse(self, response): 
    yield { 
     'title': response.xpath('//title/text()').extract_first(), 
     'link': response.url, 
     'productID': response.xpath('//span[@itemprop="productID"]/text()').extract_first(), 
     'model': response.xpath('//span[@itemprop="model"]/text()').extract_first(), 
     'price': response.css('div.price-wrapper').xpath('.//span[@itemprop="price"]/text()').extract_first(), 
    }

，並得到這個：

$ scrapy crawl cartier 
2016-06-08 17:16:08 [scrapy] INFO: Scrapy 1.1.0 started (bot: stack37701774) 
2016-06-08 17:16:08 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'stack37701774.spiders', 'SPIDER_MODULES': ['stack37701774.spiders'], 'BOT_NAME': 'stack37701774'} 
(...) 
2016-06-08 17:16:08 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy_splash.SplashCookiesMiddleware', 
'scrapy_splash.SplashMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-06-08 17:16:08 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-06-08 17:16:08 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-06-08 17:16:08 [scrapy] INFO: Spider opened 
2016-06-08 17:16:08 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-06-08 17:16:08 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-06-08 17:16:11 [scrapy] DEBUG: Crawled (200) <GET http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html via http://localhost:8050/render.html> (referer: None) 
2016-06-08 17:16:11 [scrapy] DEBUG: Scraped from <200 http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html> 
{'model': u'Ballon Bleu de Cartier watch', 'productID': u'W69017Z4', 'link': 'http://www.cartier.co.uk/en-gb/collections/watches/mens-watches/ballon-bleu-de-cartier/w69017z4-ballon-bleu-de-cartier-watch.html', 'price': None, 'title': u'CRW69017Z4 - Ballon Bleu de Cartier watch - 36 mm, steel, leather - Cartier'} 
2016-06-08 17:16:11 [scrapy] INFO: Closing spider (finished) 
2016-06-08 17:16:11 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 618, 
'downloader/request_count': 1, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 213006, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 6, 8, 15, 16, 11, 201281), 
'item_scraped_count': 1, 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'splash/render.html/request_count': 1, 
'splash/render.html/response_count/200': 1, 
'start_time': datetime.datetime(2016, 6, 8, 15, 16, 8, 545105)} 
2016-06-08 17:16:11 [scrapy] INFO: Spider closed (finished)

來源

2016-06-08 15:17:50

謝謝！顯然這是一個配置錯誤的問題，而代碼雖然很醜，但工作正常;遵循建議的'scrapy-splash'配置程序再次解決了這個問題。 –

@PaoloBrasolin什麼「錯誤配置」問題在這裏起到了作用？我很好奇，所以我可以檢測到我的問題是否一樣。 –

@RafaelAlmeida事實上，我無法追溯前一個設置中的精確錯誤：我剛剛初始化一個新的scrapy項目，使用scrapy-splash指南配置它並重新啓動了splash服務器。新鮮開始後，一切正常。 –

我試過SplashRequest，遇到了同樣的問題。在搞亂之後，我嘗試執行一個LUA腳本。

script = """ 
       function main(splash) 
    local url = splash.args.url 
    assert(splash:go(url)) 
    assert(splash:wait(0.5)) 
    return { 
    html = splash:html(), 
    png = splash:png(), 
    har = splash:har(), 
    } 
end 
"""

然後用腳本作爲參數發出請求。你可以搞砸腳本。在localhost：9200或您選擇的另一個端口的shell上測試它。

yield SplashRequest(
      url, 
      self.parse, args={'lua_source': self.script}, endpoint='execute')

哦，順便說一句，你收益率信息的方式很奇怪，用物品代替。

來源

2016-06-08 13:02:38

謝謝！我會稍微等一下，看看有沒有人在接受你的答案之前解釋了什麼不起作用。 –

我沒有足夠的聲譽，以添加評論，所以我必須做出一個答案在這裏。

如果我爲SplashRequest設置'Accept-Encoding': 'gzip'，我正面臨與Splash 2.1類似的問題，返回「格式錯誤」（未壓縮的實際拼寫）HTML內容。

最後，我發現這裏的解決方案：https://github.com/scrapinghub/splash/pull/102 變化'Accept-Encoding': 'gzip'到： 'Accept-Encoding': 'deflate'

我不知道爲什麼，但它的作品。

來源

2016-08-01 07:42:12 laoyur

感謝您的意見！ –

Scrapy選擇器不能在Splash響應

回答

相關問題