scrapy請求/響應（爬行到第2,3頁等）

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 
from asdf.items import AsdfItem 
from scrapy.contrib.loader import ItemLoader 
from scrapy.contrib.loader.processor import TakeFirst 
from scrapy.http.request import Request 
import scrapy 

class ProductLoader(ItemLoader): 
    default_output_processor = TakeFirst() 

class MySpider(BaseSpider): 
    name = "asdf" 

search_text = "midi key synth" 

allowed_domains = ["http://www.amazon.com"] 
    start_urls = ["http://www.amazon.com/s?ie=UTF8&page=1&rh=i%3Aaps%2Ck%3A" + search_text] 

def parse(self, response): 
    #title 
    view = '//a[contains(@class, "a-link-normal s-access-detail-page a-text-normal")]' 
    nextPage = '//a[contains(@title, "Next Page")]' 
    nextPageLink = 'http://www.amazon.com' + response.xpath(nextPage + '/@href').extract()[0] 
    i = 0 
    for sel in response.xpath(view): 

     l = ItemLoader(item=AsdfItem(), selector=sel) 
     l.add_xpath('title','.//@title') 
     i+=1 
     yield l.load_item() 

    request = Request(nextPageLink, callback=self.parse_page2) 
    request.meta['item'] = AsdfItem() 
    yield request 

def parse_page2(self, reponse): 
    #title 
    view = '//a[contains(@class, "a-link-normal s-access-detail-page a-text-normal")]' 
    nextPage = '//a[contains(@title, "Next Page")]' 
    nextPageLink = 'http://www.amazon.com' + response.xpath(nextPage + '/@href').extract()[0] 
    i = 0 
    for sel in response.xpath(view): 

     l = ItemLoader(item=AsdfItem(), selector=sel) 
     l.add_xpath('title','.//@title') 
     i+=1 
     yield l.load_item()

我有一個scrapy bot爬行亞馬遜和尋找標題。爲什麼響應/請求不適用於抓取後續頁面？我通過創建nextPageLink變量並將其推入請求來識別下一頁。爲什麼這不起作用？我怎麼修復它？scrapy請求/響應（爬行到第2,3頁等）

理想情況下，我想抓取所有後續頁面。

來源

2015-06-21 andartic

下手，你的'允許domains'應該_not_包括協議。嘗試'allowed_domains = ['www.amazon.com']。 – tegancp

有些事情你應該考慮：

調試： Scrapy有幾種方式來幫助確定爲什麼你的蜘蛛是不是表現你想/希望的方式。在scrapy文檔中查看Debugging Spiders;這可能是文檔中最重要的一頁。
Scrapy殼牌： 尤其是scrapy shell是無價的檢查什麼是真正發生的事情與你的蜘蛛（而不是你想有發生什麼）。例如，如果您運行的scrapy shell帶有您想要開始的網址，請致電view(response)，您可以驗證蜘蛛是否進入您期望的頁面。
您的代碼：從快看看你的代碼的一些具體意見：
- 刪除您允許域的http://
- 有在你給蜘蛛的URL空間可能是不如果你想讓蜘蛛在每個頁面上基本做同樣的事情（即收集信息並遵循「下一頁」鏈接），你可能更好地用一個回調方法來組織你的代碼（即爲什麼你需要parse-page2？）
- 變量i在做什麼？
- 您正在試圖完成什麼，你可能要繼承CrawlSpider代替

來源

2015-06-22 20:03:28 tegancp

scrapy請求/響應（爬行到第2,3頁等）

回答

相關問題