1
基本上問題是在跟蹤鏈接簡單的Scrapy履帶沒有跟蹤鏈接和抓取
我要從第1..2..3..4..5頁...... 90頁在總
每一頁具有100個左右的鏈接
每一頁是在該格式
http://www.consumercomplaints.in/lastcompanieslist/page/1
http://www.consumercomplaints.in/lastcompanieslist/page/2
http://www.consumercomplaints.in/lastcompanieslist/page/3
http://www.consumercomplaints.in/lastcompanieslist/page/4
這是正則表達式匹配規則
Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
我要每一頁,然後創建一個Request
對象湊所有在每個頁面的鏈接
Scrapy抓取只有總每次在179頁的鏈接,然後給出了一個finished
狀態
我究竟做錯了什麼?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urlparse
class consumercomplaints_spider(CrawlSpider):
name = "test_complaints"
allowed_domains = ["www.consumercomplaints.in"]
protocol='http://'
start_urls = [
"http://www.consumercomplaints.in/lastcompanieslist/"
]
#These are the rules for matching the domain links using a regularexpression, only matched links are crawled
rules = [
Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
]
def parse_data(self, response):
#Get All the links in the page using xpath selector
all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract()
#Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object
for relative_link in all_page_links:
print "relative link procesed:"+relative_link
absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip())
request = scrapy.Request(absolute_link,
callback=self.parse_complaint_page)
return request
return {}
def parse_complaint_page(self,response):
print "SCRAPED"+response.url
return {}
對不起之間的差異,但我沒有得到。你需要抓取90個鏈接?什麼是179頁? – Nabin
@Nabin編輯的問題,對不起。我需要按照90頁,每頁有100個鏈接刮。 Scrapy只有總共179個 – wolfgang
您確定每個頁面中的所有這100個鏈接都在同一個域中嗎?即__allowed_domain__ – Nabin