2017-05-30 107 views
0

我在下面這個網站的分頁問題:http://gamesurf.tiscali.it/ps4/recensioni.htmlScrapy蟒蛇以下分頁

我的代碼蜘蛛部分:

for pag in response.css('li.square-nav'): 
    next = pag.css('li.square-nav > a > span::text').extract_first() 
    if next=='»': 
     next_page_url = pag.css('a::attr(href)').extract_first() 
     if next_page_url: 
      next_page_url = response.urljoin(next_page_url) 
      yield scrapy.Request(url=next_page_url, callback=self.parse) 

如果我在Windows終端上運行我的蜘蛛它適用於所有頁面該網站,但是當我部署到scrapinghub並從儀表板中的按鈕運行時,蜘蛛只刮掉網站的第一頁。 日誌消息之間有一個警告:

[py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21: 
UnicodeWarning: Unicode equal comparison failed to convert both arguments to 
Unicode - interpreting them as being unequal. 

21行是這樣的:

if next=='»': 

我已經檢查問題不是由robot.txt的造成的。 我該如何解決這個問題? 感謝

這裏整個蜘蛛:

# -*- coding: utf-8 -*- 
import scrapy 


class QuotesSpider(scrapy.Spider): 
    name = 'reccygsall' 
    allowed_domains = ['gamesurf.tiscali.it'] 
    start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html'] 

def parse(self, response): 
    for quote in response.css("div.boxn1"): 
     item = { 
      'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(), 
      'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(), 
     } 
     yield item 


    for pag in response.css('li.square-nav'): 
     next = pag.css('li.square-nav > a > span::text').extract_first() 
     if next=='»': 
      next_page_url = pag.css('a::attr(href)').extract_first() 
      if next_page_url: 
       next_page_url = response.urljoin(next_page_url) 
       yield scrapy.Request(url=next_page_url, callback=self.parse) 
+0

你可以嘗試用XPath來定位元素:'//李[@類=「方-nav「]/a [span]/@ href' – vold

+0

嘗試在蜘蛛模塊源文件的開頭添加'# - * - coding:utf-8 - * - ',並使用'if next == u' »':' –

+0

next == u'»': ^ SyntaxError:無效的語法 –

回答

0

我找到了一個解決方案:

# -*- coding: utf-8 -*- 

import scrapy 


class QuotesSpider(scrapy.Spider): 
    name = 'reccygsall' 
    allowed_domains = ['gamesurf.tiscali.it'] 
    start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html'] 

    contatore = 0 

    def parse(self, response): 
     for quote in response.css("div.boxn1"): 
      item = { 
       'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(), 
       'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(), 
      } 
      yield item 


      self.contatore = self.contatore + 1 
      a = 0 
      for pag in response.css('li.square-nav'): 
       next = pag.css('a::text').extract_first() 
       if next is None: 
        a = a+1; 
         if (self.contatore < 2) or (a > 1): 
          next_page_url = pag.css('a::attr(href)').extract_first() 

          if next_page_url: 
           next_page_url = response.urljoin(next_page_url) 
           yield scrapy.Request(url=next_page_url, callback=self.parse)