2016-07-24 89 views
1

我不知道爲什麼我的蜘蛛不工作!我是由沒有意味着一個程序員,所以請親切!哈哈Scrapy:自定義回調不起作用

背景: 我試圖抓取與使用「Scrapy」在靛藍上找到的圖書有關的信息。

問題: 我的代碼不進入任何我的自定義回調的......看來只有在工作的時候使用「解析」的回撥。

如果我要將代碼中的「規則」部分中的回調從「parse_books」更改爲「parse」,那麼我將所有鏈接列表的方法工作得很好,並打印出所有我感興趣的鏈接。但是,該方法中的回調(指向「parse_books」)永遠不會被調用!奇怪的是,如果我要將「parse」方法重命名爲其他方法(即 - >「testmethod」),然後將「parse_books」方法重命名爲「parse」,那麼我將刮到項目中的方法工作得很好!

我想要實現: 所有我想要做的就是進入一個頁面,讓我們說「暢銷書」,導航到相應的項級頁面的所有項目,並颳去所有的本書相關信息。我似乎有兩個東西都獨立工作:/

該代碼!

import scrapy 
import json 
import urllib 
from scrapy.http import Request 
from urllib import urlencode 
import re 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 
import urlparse 



from TEST20160709.items import IndigoItem 
from TEST20160709.items import SecondaryItem 



item = IndigoItem() 
scrapedItem = SecondaryItem() 

class IndigoSpider(CrawlSpider): 

    protocol='https://' 
    name = "site" 
    allowed_domains = [ 
    "chapters.indigo.ca/en-ca/Books", 
    "chapters.indigo.ca/en-ca/Store/Availability/" 
    ] 

    start_urls = [ 
     'https://www.chapters.indigo.ca/en-ca/books/bestsellers/', 
    ] 

    #extractor = SgmlLinkExtractor()s 

    rules = (
    Rule(LinkExtractor(), follow = True), 
    Rule(LinkExtractor(), callback = "parse_books", follow = True), 
    ) 



    def getInventory (self, bookID): 
     params ={ 
     'pid' : bookID, 
     'catalog' : 'books' 
     } 
     yield Request(
      url="https://www.chapters.indigo.ca/en-ca/Store/Availability/?" + urlencode(params), 
      dont_filter = True, 
      callback = self.parseInventory 
     ) 



    def parseInventory(self,response): 
     dataInventory = json.loads(response.body) 

     for entry in dataInventory ['Data']: 
      scrapedItem['storeID'] = entry['ID'] 
      scrapedItem['storeType'] = entry['StoreType'] 
      scrapedItem['storeName'] = entry['Name'] 
      scrapedItem['storeAddress'] = entry['Address'] 
      scrapedItem['storeCity'] = entry['City'] 
      scrapedItem['storePostalCode'] = entry['PostalCode'] 
      scrapedItem['storeProvince'] = entry['Province'] 
      scrapedItem['storePhone'] = entry['Phone'] 
      scrapedItem['storeQuantity'] = entry['QTY'] 
      scrapedItem['storeQuantityMessage'] = entry['QTYMsg'] 
      scrapedItem['storeHours'] = entry['StoreHours'] 
      scrapedItem['storeStockAvailibility'] = entry['HasRetailStock'] 
      scrapedItem['storeExclusivity'] = entry['InStoreExlusive'] 

      yield scrapedItem 



    def parse (self, response): 
     #GET ALL PAGE LINKS 
     all_page_links = response.xpath('//ul/li/a/@href').extract() 
     for relative_link in all_page_links: 
      absolute_link = urlparse.urljoin(self.protocol+"www.chapters.indigo.ca",relative_link.strip()) 
      absolute_link = absolute_link.split("?ref=",1)[0] 
      request = scrapy.Request(absolute_link, callback=self.parse_books) 
      print "FULL link: "+absolute_link 

      yield Request(absolute_link, callback=self.parse_books) 





    def parse_books (self, response): 

     for sel in response.xpath('//form[@id="aspnetForm"]/main[@id="main"]'): 
      #XML/HTTP/CSS ITEMS 
      item['title']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h1[@id="product-title"][@class][@data-auto-id]/text()').extract()) 
      item['authors']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h2[@class="major-contributor"]/a[contains(@class, "byLink")][@href]/text()').extract()) 
      item['productSpecs']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/p[@class="product-specs"]/text()').extract()) 
      item['instoreAvailability']= map(unicode.strip, sel.xpath('//span[@class="stockAvailable-mesg negative"][@data-auto-id]/text()').extract()) 
      item['onlinePrice']= map(unicode.strip, sel.xpath('//span[@id][@class="nonmemberprice__specialprice"]/text()').extract()) 
      item['listPrice']= map(unicode.strip, sel.xpath('//del/text()').extract()) 

      aboutBookTemp = map(unicode.strip, sel.xpath('//div[@class="read-more"]/p/text()').extract()) 
      item['aboutBook']= [aboutBookTemp] 

      #Retrieve ISBN Identifier and extract numeric data 
      ISBN_parse = map(unicode.strip, sel.xpath('(//div[@class="isbn-info"]/p[2])[1]/text()').extract()) 
      item['ISBN13']= [elem[11:] for elem in ISBN_parse] 
      bookIdentifier = str(item['ISBN13']) 
      bookIdentifier = re.sub("[^0-9]", "", bookIdentifier) 


      print "THIS IS THE IDENTIFIER:" + bookIdentifier 

      if bookIdentifier: 
       yield self.getInventory(str(bookIdentifier)) 

      yield item 
+0

你的方法看起來不合時宜。你能否請格式化代碼? – masnun

回答

1

我注意到的第一個問題之一是你的allowed_domains class屬性被破壞。它應該包含域名(因此名稱)。你的情況

正確的價值將是:

allowed_domains = [ 
    "chapters.indigo.ca", # subdomain.domain.top_level_domain 
] 

如果你檢查你的蜘蛛登錄您將看到:

DEBUG: Filtered offsite request to 'www.chapters.indigo.ca' 

這不應該發生。

+0

謝謝!它似乎在工作! 「parseInventory」方法似乎沒有被觸發,但你已經明確保存了一天。我深深感謝它! –

+0

沒問題,隨時接受這個問題,如果你發現它可以解決你的問題:) – Granitosaurus

+0

完成,再次感謝! –