2015-07-21 89 views
1

基本上問題是在跟蹤鏈接簡單的Scrapy履帶沒有跟蹤鏈接和抓取

我要從第1..2..3..4..5頁...... 90頁在總

每一頁具有100個左右的鏈接

每一頁是在該格式

http://www.consumercomplaints.in/lastcompanieslist/page/1 
http://www.consumercomplaints.in/lastcompanieslist/page/2 
http://www.consumercomplaints.in/lastcompanieslist/page/3 
http://www.consumercomplaints.in/lastcompanieslist/page/4 

這是正則表達式匹配規則

Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data") 

我要每一頁,然後創建一個Request對象湊所有在每個頁面的鏈接

Scrapy抓取只有總每次在179頁的鏈接,然後給出了一個finished狀態

我究竟做錯了什麼?

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
import urlparse 

class consumercomplaints_spider(CrawlSpider): 
    name = "test_complaints" 
    allowed_domains = ["www.consumercomplaints.in"] 
    protocol='http://' 

    start_urls = [ 
     "http://www.consumercomplaints.in/lastcompanieslist/" 
    ] 

    #These are the rules for matching the domain links using a regularexpression, only matched links are crawled 
    rules = [ 
     Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data") 
    ] 


    def parse_data(self, response): 
     #Get All the links in the page using xpath selector 
     all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract() 

     #Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object 
     for relative_link in all_page_links: 
      print "relative link procesed:"+relative_link 

      absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip()) 
      request = scrapy.Request(absolute_link, 
         callback=self.parse_complaint_page) 
      return request 


     return {} 

    def parse_complaint_page(self,response): 
     print "SCRAPED"+response.url 
     return {} 
+0

對不起之間的差異,但我沒有得到。你需要抓取90個鏈接?什麼是179頁? – Nabin

+1

@Nabin編輯的問題,對不起。我需要按照90頁,每頁有100個鏈接刮。 Scrapy只有總共179個 – wolfgang

+0

您確定每個頁面中的所有這100個鏈接都在同一個域中嗎?即__allowed_domain__ – Nabin

回答

1

您將需要使用yield而不是return。

爲每個新的Request對象,使用yield request代替return reqeust

查看更多有關產量here,並將它們與理性here