Scrapy蟒蛇以下分頁

我在下面這個網站的分頁問題：http://gamesurf.tiscali.it/ps4/recensioni.html Scrapy蟒蛇以下分頁

我的代碼蜘蛛部分：

for pag in response.css('li.square-nav'): 
    next = pag.css('li.square-nav > a > span::text').extract_first() 
    if next=='»': 
     next_page_url = pag.css('a::attr(href)').extract_first() 
     if next_page_url: 
      next_page_url = response.urljoin(next_page_url) 
      yield scrapy.Request(url=next_page_url, callback=self.parse)

如果我在Windows終端上運行我的蜘蛛它適用於所有頁面該網站，但是當我部署到scrapinghub並從儀表板中的按鈕運行時，蜘蛛只刮掉網站的第一頁。日誌消息之間有一個警告：

[py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21: 
UnicodeWarning: Unicode equal comparison failed to convert both arguments to 
Unicode - interpreting them as being unequal.

21行是這樣的：

if next=='»':

我已經檢查問題不是由robot.txt的造成的。我該如何解決這個問題？感謝

這裏整個蜘蛛：

# -*- coding: utf-8 -*- 
import scrapy 


class QuotesSpider(scrapy.Spider): 
    name = 'reccygsall' 
    allowed_domains = ['gamesurf.tiscali.it'] 
    start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html'] 

def parse(self, response): 
    for quote in response.css("div.boxn1"): 
     item = { 
      'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(), 
      'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(), 
     } 
     yield item 


    for pag in response.css('li.square-nav'): 
     next = pag.css('li.square-nav > a > span::text').extract_first() 
     if next=='»': 
      next_page_url = pag.css('a::attr(href)').extract_first() 
      if next_page_url: 
       next_page_url = response.urljoin(next_page_url) 
       yield scrapy.Request(url=next_page_url, callback=self.parse)

來源

2017-05-30 L. Serafino

你可以嘗試用XPath來定位元素：'//李[@類=「方-nav「]/a [span]/@ href' – vold

嘗試在蜘蛛模塊源文件的開頭添加'＃ - * - coding：utf-8 - * - '，並使用'if next == u' »'：' –

next == u'»'： ^ SyntaxError：無效的語法 –

我找到了一個解決方案：

# -*- coding: utf-8 -*- 

import scrapy 


class QuotesSpider(scrapy.Spider): 
    name = 'reccygsall' 
    allowed_domains = ['gamesurf.tiscali.it'] 
    start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html'] 

    contatore = 0 

    def parse(self, response): 
     for quote in response.css("div.boxn1"): 
      item = { 
       'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(), 
       'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(), 
      } 
      yield item 


      self.contatore = self.contatore + 1 
      a = 0 
      for pag in response.css('li.square-nav'): 
       next = pag.css('a::text').extract_first() 
       if next is None: 
        a = a+1; 
         if (self.contatore < 2) or (a > 1): 
          next_page_url = pag.css('a::attr(href)').extract_first() 

          if next_page_url: 
           next_page_url = response.urljoin(next_page_url) 
           yield scrapy.Request(url=next_page_url, callback=self.parse)

來源

2017-05-30 23:57:50

Scrapy蟒蛇以下分頁

回答

相關問題