Scrapy沒有關注下一頁的網址，爲什麼？

我在抓這個網站：https://www.olx.com.ar/celulares-telefonos-cat-831與Scrapy 1.4.0。當我運行蜘蛛時，一切都很順利，直到進入「下一頁」部分。下面的代碼：Scrapy沒有關注下一頁的網址，爲什麼？

# -*- coding: utf-8 -*- 
import scrapy 
#import time 

class OlxarSpider(scrapy.Spider): 
name = "olxar" 
allowed_domains = ["olx.com.ar"] 
start_urls = ['https://www.olx.com.ar/celulares-telefonos-cat-831'] 

def parse(self, response): 
    #time.sleep(10) 
    response = response.replace(body=response.body.replace('<br>', '')) 
    SET_SELECTOR = '.item' 
    for item in response.css(SET_SELECTOR): 
     PRODUCTO_SELECTOR = '.items-info h3 ::text' 
     yield { 
      'producto': item.css(PRODUCTO_SELECTOR).extract_first().replace(',',' '), 
      } 

    NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)' 
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first().replace('//','https://') 
    if next_page: 
     yield scrapy.Request(response.urljoin(next_page), 
      callback=self.parse 
      )

我在一些人加入dont_filter = True屬性爲Request其他問題見過，但這並不爲我工作。它只是讓蜘蛛循環超過前兩頁。我已經添加了replace('//','https://')部分來修復沒有https:而無法遵循Scrapy的原始href。另外，當我運行蜘蛛時，它將第一頁剪下來，然後返回[scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.olx.com.ar/celulares-telefonos-cat-831-p-2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 爲什麼當它顯然不是過濾第二頁像重複？

我在評論中應用了Tarun Lalwani解決方案。我錯過了那麼糟糕的細節！它與修正正常工作謝謝你！

來源

2017-09-22 A.Lorefice

它的奇怪的代碼。如果您使用response.urljoin，爲什麼需要.replace（'//'，'https：//'）？請提供蜘蛛的所有代碼。 – Verz1Lka

我已經嘗試了沒有https部分，並獲得相同的結果，所以沒有問題。有一些我錯過了關於網頁或我的代碼。 –

你的問題是CSS選擇器。在第1頁上，它與下一頁鏈接相匹配。在第2頁上，它與上一頁和下一頁鏈接相匹配。指出你挑的第一個使用extract_first()，使您的第一和第二頁之間只需旋轉僅

解決方法很簡單，你需要改變CSS選擇

NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'

到

NEXT_PAGE_SELECTOR = '.items-paginations-buttons a.next::attr(href)'

這將只識別下一頁的網址

來源

2017-09-22 22:17:23

Scrapy沒有關注下一頁的網址，爲什麼？

回答

相關問題