我不知道爲什麼我的蜘蛛不工作!我是由沒有意味着一個程序員,所以請親切!哈哈Scrapy:自定義回調不起作用
背景: 我試圖抓取與使用「Scrapy」在靛藍上找到的圖書有關的信息。
問題: 我的代碼不進入任何我的自定義回調的......看來只有在工作的時候使用「解析」的回撥。
如果我要將代碼中的「規則」部分中的回調從「parse_books」更改爲「parse」,那麼我將所有鏈接列表的方法工作得很好,並打印出所有我感興趣的鏈接。但是,該方法中的回調(指向「parse_books」)永遠不會被調用!奇怪的是,如果我要將「parse」方法重命名爲其他方法(即 - >「testmethod」),然後將「parse_books」方法重命名爲「parse」,那麼我將刮到項目中的方法工作得很好!
我想要實現: 所有我想要做的就是進入一個頁面,讓我們說「暢銷書」,導航到相應的項級頁面的所有項目,並颳去所有的本書相關信息。我似乎有兩個東西都獨立工作:/
該代碼!
import scrapy
import json
import urllib
from scrapy.http import Request
from urllib import urlencode
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import urlparse
from TEST20160709.items import IndigoItem
from TEST20160709.items import SecondaryItem
item = IndigoItem()
scrapedItem = SecondaryItem()
class IndigoSpider(CrawlSpider):
protocol='https://'
name = "site"
allowed_domains = [
"chapters.indigo.ca/en-ca/Books",
"chapters.indigo.ca/en-ca/Store/Availability/"
]
start_urls = [
'https://www.chapters.indigo.ca/en-ca/books/bestsellers/',
]
#extractor = SgmlLinkExtractor()s
rules = (
Rule(LinkExtractor(), follow = True),
Rule(LinkExtractor(), callback = "parse_books", follow = True),
)
def getInventory (self, bookID):
params ={
'pid' : bookID,
'catalog' : 'books'
}
yield Request(
url="https://www.chapters.indigo.ca/en-ca/Store/Availability/?" + urlencode(params),
dont_filter = True,
callback = self.parseInventory
)
def parseInventory(self,response):
dataInventory = json.loads(response.body)
for entry in dataInventory ['Data']:
scrapedItem['storeID'] = entry['ID']
scrapedItem['storeType'] = entry['StoreType']
scrapedItem['storeName'] = entry['Name']
scrapedItem['storeAddress'] = entry['Address']
scrapedItem['storeCity'] = entry['City']
scrapedItem['storePostalCode'] = entry['PostalCode']
scrapedItem['storeProvince'] = entry['Province']
scrapedItem['storePhone'] = entry['Phone']
scrapedItem['storeQuantity'] = entry['QTY']
scrapedItem['storeQuantityMessage'] = entry['QTYMsg']
scrapedItem['storeHours'] = entry['StoreHours']
scrapedItem['storeStockAvailibility'] = entry['HasRetailStock']
scrapedItem['storeExclusivity'] = entry['InStoreExlusive']
yield scrapedItem
def parse (self, response):
#GET ALL PAGE LINKS
all_page_links = response.xpath('//ul/li/a/@href').extract()
for relative_link in all_page_links:
absolute_link = urlparse.urljoin(self.protocol+"www.chapters.indigo.ca",relative_link.strip())
absolute_link = absolute_link.split("?ref=",1)[0]
request = scrapy.Request(absolute_link, callback=self.parse_books)
print "FULL link: "+absolute_link
yield Request(absolute_link, callback=self.parse_books)
def parse_books (self, response):
for sel in response.xpath('//form[@id="aspnetForm"]/main[@id="main"]'):
#XML/HTTP/CSS ITEMS
item['title']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h1[@id="product-title"][@class][@data-auto-id]/text()').extract())
item['authors']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h2[@class="major-contributor"]/a[contains(@class, "byLink")][@href]/text()').extract())
item['productSpecs']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/p[@class="product-specs"]/text()').extract())
item['instoreAvailability']= map(unicode.strip, sel.xpath('//span[@class="stockAvailable-mesg negative"][@data-auto-id]/text()').extract())
item['onlinePrice']= map(unicode.strip, sel.xpath('//span[@id][@class="nonmemberprice__specialprice"]/text()').extract())
item['listPrice']= map(unicode.strip, sel.xpath('//del/text()').extract())
aboutBookTemp = map(unicode.strip, sel.xpath('//div[@class="read-more"]/p/text()').extract())
item['aboutBook']= [aboutBookTemp]
#Retrieve ISBN Identifier and extract numeric data
ISBN_parse = map(unicode.strip, sel.xpath('(//div[@class="isbn-info"]/p[2])[1]/text()').extract())
item['ISBN13']= [elem[11:] for elem in ISBN_parse]
bookIdentifier = str(item['ISBN13'])
bookIdentifier = re.sub("[^0-9]", "", bookIdentifier)
print "THIS IS THE IDENTIFIER:" + bookIdentifier
if bookIdentifier:
yield self.getInventory(str(bookIdentifier))
yield item
你的方法看起來不合時宜。你能否請格式化代碼? – masnun