如何找到外部404s

我正在構建scrapy的抓取工具，該抓取工具應抓取整個域以查找損壞的EXTERNAL鏈接。如何找到外部404s

我有以下幾點：

class domainget(CrawlSpider): 
    name = 'getdomains' 
    allowed_domains = ['start.co.uk'] 
    start_urls = ['http://www.start.co.uk'] 

    rules = (
     Rule(LinkExtractor('/'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response): 
      resp = scrapy.Request(link.url, callback=self.parse_ext) 


    def parse_ext(self, response): 
     self.logger.info('>>>>>>>>>> Reading: %s', response.url)

當我運行這段代碼，它永遠不會到達parse_ext（）函數，我想獲得的HTTP狀態代碼，做在此基礎上進一步的處理。

你可以看到我使用parse_ext（）作爲回調函數，當我在parse_item（）函數的頁面上循環提取的鏈接時。

我在做什麼錯？

來源

2016-04-15 web_la

你是不是從回調返回Request實例：

def parse_item(self, response): 
    for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response): 
     yield scrapy.Request(link.url, callback=self.parse_ext) 

def parse_ext(self, response): 
    self.logger.info('>>>>>>>>>> Reading: %s', response.url)

來源

2016-04-15 14:04:21 alecxe

賓果！此外，我不得不將dont_filter = True添加到Request對象中，如下所示： yield scrapy.Request（link.url，callback = self.parse_ext，dont_filter = True） –

如何找到外部404s

回答

相關問題