1
我正在構建scrapy的抓取工具,該抓取工具應抓取整個域以查找損壞的EXTERNAL鏈接。如何找到外部404s
我有以下幾點:
class domainget(CrawlSpider):
name = 'getdomains'
allowed_domains = ['start.co.uk']
start_urls = ['http://www.start.co.uk']
rules = (
Rule(LinkExtractor('/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response):
resp = scrapy.Request(link.url, callback=self.parse_ext)
def parse_ext(self, response):
self.logger.info('>>>>>>>>>> Reading: %s', response.url)
當我運行這段代碼,它永遠不會到達parse_ext()函數,我想獲得的HTTP狀態代碼,做在此基礎上進一步的處理。
你可以看到我使用parse_ext()作爲回調函數,當我在parse_item()函數的頁面上循環提取的鏈接時。
我在做什麼錯?
賓果!此外,我不得不將dont_filter = True添加到Request對象中,如下所示: yield scrapy.Request(link.url,callback = self.parse_ext,dont_filter = True) –