不能遵循使用Scrapy

鏈接我已經創建了一個擴展CrawlSpider，隨後在http://scrapy.readthedocs.org/en/latest/topics/spiders.html 不能遵循使用Scrapy

問題的建議蜘蛛是我需要解析兩個起始URL（這恰好吻合與主機名）以及它所擁有的一些鏈接。

所以我定義了一條規則：rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)]，但沒有任何反應。

然後我試着定義一組規則，如：rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]。現在的問題是，蜘蛛解析一切。

我該如何告訴蜘蛛解析_start_url_以及它只包含一些鏈接？

更新：

我試圖重寫parse_start_url方法，所以現在我能夠從一開始就獲得頁面的數據，但它仍然沒有遵循與Rule定義的鏈接：

class ExampleSpider(CrawlSpider): 
    name = 'TechCrunchCrawler' 
    start_urls = ['http://techcrunch.com'] 
    allowed_domains = ['techcrunch.com'] 
    rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)] 

    def parse_start_url(self, response): 
     print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' 
     return self.parse_links(response) 

    def parse_links(self, response): 
     print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' 
     articles = [] 
     for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'): 
      article = Article() 
      article['title'] = i.select('./@title').extract() 
      article['link'] = i.select('./@href').extract() 
      articles.append(article) 

     return articles

來源

2012-07-06 alexyz78

ü可以張貼一些的我們的代碼在這裏識別以及 – 2012-07-10 09:36:28

我在過去有類似的問題。
我堅持BaseSpider。

試試這個：

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from scrapy.contrib.loader import XPathItemLoader 

from techCrunch.items import Article 


class techCrunch(BaseSpider): 
    name = 'techCrunchCrawler' 
    allowed_domains = ['techcrunch.com'] 

    # This gets your start page and directs it to get parse manager 
    def start_requests(self): 
     return [Request("http://techcrunch.com", callback=self.parseMgr)] 

    # the parse manager deals out what to parse and start page extraction 
    def parseMgr(self, response): 
     print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' 
     yield self.pageParser(response) 

     nextPage = HtmlXPathSelector(response).select("//div[@class='page-next']/a/@href").extract() 
     if nextPage: 
      yield Request(nextPage[0], callback=self.parseMgr) 

    # The page parser only parses the pages and returns items on each page call 
    def pageParser(self, response): 
     print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' 
     loader = XPathItemLoader(item=Article(), response=response) 
     loader.add_xpath('title', '//h2[@class="headline"]/a/@title') 
     loader.add_xpath('link', '//h2[@class="headline"]/a/@href') 
     return loader.load_item()

來源

2012-07-26 20:53:52 user1460015

你忘記反斜槓轉義字母d爲\d：

>>> SgmlLinkExtractor(allow=r'/page/d+').extract_links(response) 
[] 
>>> SgmlLinkExtractor(allow=r'/page/\d+').extract_links(response) 
[Link(url='http://techcrunch.com/page/2/', text=u'Next Page',...)]

來源

2012-07-27 15:01:34

不能遵循使用Scrapy

回答

相關問題