我刮的網站有多個產品具有相同的ID但價格不同。我想只保留最低的價格版本。我該如何保留Scrapy的最低價格產品?
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = dict()
def process_item(self, item, spider):
if item['ID'] in self.ids_seen:
if item['sale_price']>self.ids_seen[item['ID']]:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['ID'])
return item
所以這個代碼應該下降是一個更高的價格比之前看到的項目,但我無法弄清楚如何在價格較低更新先前刮項目。
# -*- coding: utf-8 -*-
import scrapy
import urlparse
import re
class ExampleSpider(scrapy.Spider):
name = 'name'
allowed_domains = ['domain1','domain2']
start_urls = ['url1','url2']
def parse(self, response):
for href in response.css('div.catalog__main__content .c-product-card__name::attr("href")').extract():
url = urlparse.urljoin(response.url, href)
yield scrapy.Request(url=url, callback=self.parse_product)
# follow pagination links
href = response.css('.c-paging__next-link::attr("href")').extract_first()
if href is not None:
url = urlparse.urljoin(response.url, href)
yield scrapy.Request(url=url, callback=self.parse)
def parse_product(self, response):
# process the response here (omitted because it's long and doesn't add anything)
yield {
'product-name': name,
'price-sale': price_sale,
'price-regular': price_regular[:-1],
'raw-sku': raw_sku,
'sku': sku.replace('_','/'),
'img': response.xpath('//img[@class="itm-img"]/@src').extract()[-1],
'description': response.xpath('//div[@class="product-description__block"]/text()').extract_first(),
'url' : response.url,
}
什麼是你刮的網站?什麼是蜘蛛的代碼? – Umair
@Umair我不能告訴你網站,但我已經包含了蜘蛛代碼。不確定它適用於這個問題,但在這裏。 –