我試圖抓取使用Scrapy的網址。但它將我重定向到不存在的頁面。scrapy-如何停止重定向(302)
Redirecting (302) to <GET http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197> from <GET http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx>
問題是http://www.shop.inonit.in/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/pid-1275197.aspx存在,但http://www.shop.inonit.in/mobile/Products/Inonit-Home-Decor--Knick-Knacks-Cushions/Shor-Sharaba/Andaz-Apna-Apna-Cushion-Cover/1275197沒有,所以爬蟲無法找到這一點。我也爬了很多其他網站,但在其他地方沒有這個問題。有沒有辦法阻止這種重定向?
任何幫助將不勝感激。謝謝。
更新:這是我的蜘蛛類
class Inon_Spider(BaseSpider):
name = 'Inon'
allowed_domains = ['www.shop.inonit.in']
start_urls = ['http://www.shop.inonit.in/Products/Inonit-Gadget-Accessories-Mobile-Covers/-The-Red-Tag/Samsung-Note-2-Dead-Mau/pid-2656465.aspx']
def parse(self, response):
item = DealspiderItem()
hxs = HtmlXPathSelector(response)
title = hxs.select('//div[@class="aboutproduct"]/div[@class="container9"]/div[@class="ctl_aboutbrand"]/h1/text()').extract()
price = hxs.select('//span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_spnWebPrice"]/span[@class="offer"]/span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_lblOfferPrice"]/text()').extract()
prc = price[0].replace("Rs. ","")
description = []
item['price'] = prc
item['title'] = title
item['description'] = description
item['url'] = response.url
return item
感謝您的答覆!但我有點困惑,把這行代碼放在哪裏?我試圖覆蓋start_requests,但它給了我一個錯誤「響應」對象沒有屬性'body_as_unicode'「。我們能否同時退回物品和請求? – 2013-03-18 15:10:38
您可以使用重定向調用hxs = HtmlXPathSelector(response),您將必須測試response.status == 302並執行其他類型的處理。在這種情況下,hxs將失敗,因爲response.body爲302狀態爲空 – 2015-01-01 15:00:55
有人測試過嗎?它不適用於當前的scrapy版本,我已經用''handle_httpstatus_list'進行了測試:[404,301]'只是404作品 – 2015-07-02 20:07:26