2016-04-15 55 views
1

我正在構建scrapy的抓取工具,該抓取工具應抓取整個域以查找損壞的EXTERNAL鏈接。如何找到外部404s

我有以下幾點:

class domainget(CrawlSpider): 
    name = 'getdomains' 
    allowed_domains = ['start.co.uk'] 
    start_urls = ['http://www.start.co.uk'] 

    rules = (
     Rule(LinkExtractor('/'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response): 
      resp = scrapy.Request(link.url, callback=self.parse_ext) 


    def parse_ext(self, response): 
     self.logger.info('>>>>>>>>>> Reading: %s', response.url) 

當我運行這段代碼,它永遠不會到達parse_ext()函數,我想獲得的HTTP狀態代碼,做在此基礎上進一步的處理。

你可以看到我使用parse_ext()作爲回調函數,當我在parse_item()函數的頁面上循環提取的鏈接時。

我在做什麼錯?

回答

0

你是不是從回調返回Request實例:

def parse_item(self, response): 
    for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response): 
     yield scrapy.Request(link.url, callback=self.parse_ext) 

def parse_ext(self, response): 
    self.logger.info('>>>>>>>>>> Reading: %s', response.url) 
+0

賓果!此外,我不得不將dont_filter = True添加到Request對象中,如下所示: yield scrapy.Request(link.url,callback = self.parse_ext,dont_filter = True) –