2015-04-03 136 views
1

有人可以幫我解決這個問題,即時通訊使用scrapy/python。我似乎無法阻止重複的數據被插入到數據庫中。舉些例子。如果我的數據庫的價格爲4000美元馬自達。如果'汽車'已經存在或'汽車價格'存在,我不希望蜘蛛再次插入抓取的數據。scrapy如何防止重複的數據被插入到數據庫中

price | car 
------------- 
$4000 | Mazda <---- 
$3000 | Mazda 3 <---- 
$4000 | BMW 
$4000 | Mazda 3 <---- I also dont want to have two results like this 
$4000 | Mazda <---- I don't want to have two results any help will be greatly appreciated - Thanks 


pipeline.py 
------------------- 
from scrapy import log 
#from scrapy.core.exceptions import DropItem 
from twisted.enterprise import adbapi 
from scrapy.http import Request 
from scrapy.exceptions import DropItem 
from scrapy.contrib.pipeline.images import ImagesPipeline 
import time 
import MySQLdb 
import MySQLdb.cursors 
import socket 
import select 
import sys 
import os 
import errno 

---------------------------------- 
when I put this peace of code, the crawled data does not save. but when removed it does save into the database. 



class DuplicatesPipeline(object): 

    def __init__(self): 
     self.car_seen = set() 

    def process_item(self, item, spider): 
     if item['car'] in self.car_seen: 
      raise DropItem("Duplicate item found: %s" % item) 
     else: 
      self.car_seen.add(item['car']) 
      return item 
-------------------------------------- 

class MySQLStorePipeline(object): 

    def __init__(self): 
     self.dbpool = adbapi.ConnectionPool('MySQLdb', 
      db = 'test', 
      user = 'root', 
      passwd = 'test', 
      cursorclass = MySQLdb.cursors.DictCursor, 
      charset = 'utf8', 
      use_unicode = False 
     ) 

    def _conditional_insert(self, tx, item): 
     if item.get('price'): 
      tx.execute(\ 
       "insert into data (\ 
       price,\ 
       car \ 
       ) \ 
       values (%s, %s)", 
       (item['price'], 
       item['car'], 
       ) 
      ) 

    def process_item(self, item, spider):     
     query = self.dbpool.runInteraction(self._conditional_insert, item) 
     return item 



settings.py 
------------ 
SPIDER_MODULES = ['car.spiders'] 
NEWSPIDER_MODULE = 'car.spiders' 
ITEM_PIPELINES = ['car.pipelines.MySQLStorePipeline'] 

回答

1

發現問題。確保duplicateatespipeline是第一個。

settings.py 
ITEM_PIPELINES = { 
    'car.pipelines.DuplicatesPipeline': 100, 
    'car.pipelines.MySQLStorePipeline': 200, 
    } 
相關問題