Scrapy檢查重複管道

我通過我下載的插件將時間戳傳遞到DynamoDB。蜘蛛每隔兩分鐘就在cron上。之前，它曾經從網站XPath中獲取時間戳，因此它是唯一的;但目前每次新運行都會生成新的時間戳，因此每次運行都會創建一個新條目。你能否請我指導一個管道解決方案來檢查是否存在相同的url，所以蜘蛛跳過它？Scrapy檢查重複管道

我的蜘蛛：

def parse(self, response): 

    for item in response.xpath("//li[contains(@class, 'river-block')]"): 
     url = item.xpath(".//h2/a/@href").extract()[0] 
     stamp = Timestamp().timestamp 
     yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp}) 

def get_details(self, response): 
     article = ArticleItem() 
     article['title'] = response.xpath("//header/h1/text()").extract_first() 
     article['url'] = format(shortener.short(response.url)) 
     article['stamp'] = response.meta['stamp'] 
     yield article

我的管道：

class DynamoDBStorePipeline(object): 

def process_item(self, item, spider): 
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2") 

    table = dynamodb.Table('x') 

    table.put_item(
    Item={ 
    'url': str(item['url']), 
    'title': item['title'].encode('utf-8'), 
    'stamp': item['stamp'], 
    } 
    ) 
    return item

來源

2017-06-02 yurashark

默認情況下Scrapy不執行相同的請求多次。

欲瞭解更多信息，你可以閱讀here約dont_filter誰是默認爲false忽略重複過濾器。

無論如何另一種解決方案，你可以創建一個數組，並檢查你的標題是否存在於你的數組中。我認爲這是更好地在這裏重複檢查比管道，因爲如果是在重複的情況下，你會不會做，你不需要

url = response.xpath("//header/h1/text()").extract_first() 
if(url not in yourArray) : 
    article = ArticleItem() 
    article['title'] = response.xpath("//header/h1/text()").extract_first() 
    article['url'] = url 
    article['stamp'] = response.meta['stamp'] 
    yourArray.append(url) 
    yield article

來源

2017-06-02 14:47:37 parik

這將檢查我的DynamoDB中的項目？ – yurashark

我寫的代碼爲您提供了具有唯一網址的項目，這意味着您不會有2個項目具有相同的網址。 – parik

網址是獨一無二的。時間戳不是因爲它們每次運行cron時都會生成。我嘗試過'attribute_not_exists'，但這並沒有幫助我。我想我需要'exists（）'，但我不知道如何實現它。對Python來說很新鮮 – yurashark

通過計算器的問題和Boto3文檔挖我之後的另一件事能夠拿出解決方案：

class DynamoDBStorePipeline(object): 

def process_item(self, item, spider): 
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2") 

    table = dynamodb.Table('x') 

    table.put_item(
    Item={ 
    'link': str(item['link']), 
    'title': item['title'].encode('utf-8'), 
    'stamp': item['stamp'], 
    }, 
    ConditionExpression = 'attribute_not_exists(link) AND attribute_not_exists(title)', 
    ) 
    return item

來源

2017-06-02 18:59:09 yurashark

Scrapy檢查重複管道

回答

相關問題