0
我想從中文網站遞歸抓取數據。我讓我的蜘蛛跟着「下一頁」的網址,直到沒有「下一頁」可用。下面是我的蜘蛛:Scrapy遞歸抓取無法抓取所有頁面
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from hrb.items_hrb import HrbItem
class HrbSpider(CrawlSpider):
name = "hrb"
allowed_domains = ["www.harbin.gov.cn"]
start_urls = ["http://bxt.harbin.gov.cn/hrb_bzbxt/list_hf.php"]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=(u'//a[@title="\u4e0b\u4e00\u9875"]',)), callback="parse_items", follow= True),
)
def parse_items(self, response):
items = []
for sel in response.xpath("//table[3]//tr[position() > 1]"):
item = HrbItem()
item['id'] = sel.xpath("td[1]/text()").extract()[0]
title = sel.xpath("td[3]/a/text()").extract()[0]
item['title'] = title.encode('gbk')
item['time1'] = sel.xpath("td[3]/text()").extract()[0][2:12]
item['time2'] = sel.xpath("td[5]/text()").extract()[1]
items.append(item)
return(items)
問題是它只颳了前15頁。我瀏覽了第15頁,並且還有一個「下一頁」按鈕。那爲什麼它停下來?網站是否打算防止刮擦?或者我的代碼存在一些問題?如果我們只允許一次刪除15頁,是否有辦法從某個頁面開始刮取,比如說?非常感謝!