2
我有一個網頁爬蟲,可以抓取網頁上的新聞報道。如何存儲Scrapy抓取的網址?
我知道如何使用XpathSelector從頁面中的元素中刮取某些信息。
但是,我似乎無法弄清楚如何存儲剛抓取的頁面的URL。
class spidey(CrawlSpider):
name = 'spidey'
start_urls = ['http://nytimes.com'] # urls from which the spider will start crawling
rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True),
# r'page/\d+' : regular expression for http://nytimes.com/page/X URLs
Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\w+']), callback='parse_articles')]
# r'\d{4}/\d{2}/\w+' : regular expression for http://nytimes.com/YYYY/MM/title URLs
我想存儲每條通過這些規則的鏈接。
我需要添加到parse_articles以將鏈接存儲在我的項目中?
def parse_articles(self, response):
item = SpideyItem()
item['link'] = ???
return item