0
我在下面這個網站的分頁問題:http://gamesurf.tiscali.it/ps4/recensioni.htmlScrapy蟒蛇以下分頁
我的代碼蜘蛛部分:
for pag in response.css('li.square-nav'):
next = pag.css('li.square-nav > a > span::text').extract_first()
if next=='»':
next_page_url = pag.css('a::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
如果我在Windows終端上運行我的蜘蛛它適用於所有頁面該網站,但是當我部署到scrapinghub並從儀表板中的按鈕運行時,蜘蛛只刮掉網站的第一頁。 日誌消息之間有一個警告:
[py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to
Unicode - interpreting them as being unequal.
21行是這樣的:
if next=='»':
我已經檢查問題不是由robot.txt的造成的。 我該如何解決這個問題? 感謝
這裏整個蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'reccygsall'
allowed_domains = ['gamesurf.tiscali.it']
start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html']
def parse(self, response):
for quote in response.css("div.boxn1"):
item = {
'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(),
'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(),
}
yield item
for pag in response.css('li.square-nav'):
next = pag.css('li.square-nav > a > span::text').extract_first()
if next=='»':
next_page_url = pag.css('a::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
你可以嘗試用XPath來定位元素:'//李[@類=「方-nav「]/a [span]/@ href' – vold
嘗試在蜘蛛模塊源文件的開頭添加'# - * - coding:utf-8 - * - ',並使用'if next == u' »':' –
next == u'»': ^ SyntaxError:無效的語法 –