2016-11-17 84 views
1

我正在嘗試抓取一些選定的域名,並且只抓取這些網站中的重要網頁。我的方法是抓取域中的一個網頁並獲取一組網址,這些網址將抓取我在第一個網頁上找到的重複出現的網址。通過這種方式,我嘗試去除所有不再發生的網址(內容網址,例如產品等)。我尋求幫助的原因是因爲scrapy.Request未被執行超過一次。 這是我到目前爲止有:Scrapy抓取多個域名,每個域名有重複發生的網址

class Finder(scrapy.Spider): 
name = "finder" 
start_urls = ['http://www.nu.nl/'] 
uniqueDomainUrl = dict() 
maximumReoccurringPages = 5 

rules = (
    Rule(
     LinkExtractor(
      allow=('.nl', '.nu', '.info', '.net', '.com', '.org', '.info'), 
      deny=('facebook','amazon', 'wordpress', 'blogspot', 'free', 'reddit', 
        'videos', 'youtube', 'google', 'doubleclick', 'microsoft', 'yahoo', 
        'bing', 'znet', 'stackexchang', 'twitter', 'wikipedia', 'creativecommons', 
        'mediawiki', 'wikidata'), 
     ), 
     process_request='parse', 
     follow=True 
    ), 
) 

def parse(self, response): 
    self.logger.info('Entering URL: %s', response.url) 
    currentUrlParse = urlparse.urlparse(response.url) 
    currentDomain = currentUrlParse.hostname 
    if currentDomain in self.uniqueDomainUrl: 
     yield 

    self.uniqueDomainUrl[currentDomain] = currentDomain 

    item = ImportUrlList() 
    response.meta['item'] = item 

    # Reoccurring URLs 
    item = self.findReoccurringUrls(response) 
    list = item['list'] 

    self.logger.info('Output: %s', list) 

    # Crawl reoccurring urls 
    #for href in list: 
    # yield scrapy.Request(response.urljoin(href), callback=self.parse) 

def findReoccurringUrls(self, response): 
    self.logger.info('Finding reoccurring URLs in: %s', response.url) 

    item = response.meta['item'] 
    urls = self.findUrlsOnCurrentPage(response) 
    item['list'] = urls 
    response.meta['item'] = item 

    # Get all URLs on each web page (limit 5 pages) 
    i = 0 
    for value in urls: 
     i += 1 
     if i > self.maximumReoccurringPages: 
      break 

     self.logger.info('Parse: %s', value) 
     request = Request(value, callback=self.test, meta={'item':item}) 
     item = request.meta['item'] 

    return item 

def test(self, response): 
    self.logger.info('Page title: %s', response.css('title').extract()) 
    item = response.meta['item'] 
    urls = self.findUrlsOnCurrentPage(response) 
    item['list'] = set(item['list']) & set(urls) 
    return item 

def findUrlsOnCurrentPage(self, response): 
    newUrls = [] 
    currentUrlParse = urlparse.urlparse(response.url) 
    currentDomain = currentUrlParse.hostname 
    currentUrl = currentUrlParse.scheme +'://'+ currentUrlParse.hostname 

    for href in response.css('a::attr(href)').extract(): 
     newUrl = urlparse.urljoin(currentUrl, href) 

     urlParse = urlparse.urlparse(newUrl) 
     domain = urlParse.hostname 

     if href.startswith('#'): 
      continue 

     if domain != currentDomain: 
      continue 

     if newUrl not in newUrls: 
      newUrls.append(newUrl) 

    return newUrls 

這似乎是隻有執行的第一頁,其他請求()不稱爲我可以在回調見。

回答

0

什麼是ImportUrlList()呢?你實現了它?

你也忘了打電話給scrapy.Request上findReoccuringUrls

request = scrapy.Request(value, callback=self.test, meta={'item':item}) 

def findReoccurringUrls(self, response): 
    self.logger.info('Finding reoccurring URLs in: %s', response.url) 

    item = response.meta['item'] 
    urls = self.findUrlsOnCurrentPage(response) 
    item['list'] = urls 
    response.meta['item'] = item 

    # Get all URLs on each web page (limit 5 pages) 
    i = 0 
    for value in urls: 
     i += 1 
     if i > self.maximumReoccurringPages: 
      break 

     self.logger.info('Parse: %s', value) 
     request = scrapy.Request(value, callback=self.test, meta={'item':item}) 
     item = request.meta['item'] 
+0

ImportUrlList只包含一個列表中的字段=字典(); 我想重新使用findUrlsOnCurrentPage,所以我爲回調做了一個新的函數,因爲我正在試驗這個函數,我稱之爲測試。 在第一次調用時,函數已經提取了一個頁面,所以我不需要再次執行請求。 –