2012-07-06 80 views
1

鏈接我已經創建了一個擴展CrawlSpider,隨後在http://scrapy.readthedocs.org/en/latest/topics/spiders.html不能遵循使用Scrapy

問題的建議蜘蛛是我需要解析兩個起始URL(這恰好吻合與主機名)以及它所擁有的一些鏈接。

所以我定義了一條規則:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)],但沒有任何反應。

然後我試着定義一組規則,如:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]。現在的問題是,蜘蛛解析一切。

我該如何告訴蜘蛛解析_start_url_以及它只包含一些鏈接?

更新:

我試圖重寫parse_start_url方法,所以現在我能夠從一開始就獲得頁面的數據,但它仍然沒有遵循與Rule定義的鏈接:

class ExampleSpider(CrawlSpider): 
    name = 'TechCrunchCrawler' 
    start_urls = ['http://techcrunch.com'] 
    allowed_domains = ['techcrunch.com'] 
    rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)] 

    def parse_start_url(self, response): 
     print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' 
     return self.parse_links(response) 

    def parse_links(self, response): 
     print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' 
     articles = [] 
     for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'): 
      article = Article() 
      article['title'] = i.select('./@title').extract() 
      article['link'] = i.select('./@href').extract() 
      articles.append(article) 

     return articles 
+1

ü可以張貼一些的我們的代碼在這裏識別以及 – 2012-07-10 09:36:28

回答

1

我在過去有類似的問題。
我堅持BaseSpider。

試試這個:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from scrapy.contrib.loader import XPathItemLoader 

from techCrunch.items import Article 


class techCrunch(BaseSpider): 
    name = 'techCrunchCrawler' 
    allowed_domains = ['techcrunch.com'] 

    # This gets your start page and directs it to get parse manager 
    def start_requests(self): 
     return [Request("http://techcrunch.com", callback=self.parseMgr)] 

    # the parse manager deals out what to parse and start page extraction 
    def parseMgr(self, response): 
     print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' 
     yield self.pageParser(response) 

     nextPage = HtmlXPathSelector(response).select("//div[@class='page-next']/a/@href").extract() 
     if nextPage: 
      yield Request(nextPage[0], callback=self.parseMgr) 

    # The page parser only parses the pages and returns items on each page call 
    def pageParser(self, response): 
     print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' 
     loader = XPathItemLoader(item=Article(), response=response) 
     loader.add_xpath('title', '//h2[@class="headline"]/a/@title') 
     loader.add_xpath('link', '//h2[@class="headline"]/a/@href') 
     return loader.load_item() 
1

你忘記反斜槓轉義字母d爲\d

>>> SgmlLinkExtractor(allow=r'/page/d+').extract_links(response) 
[] 
>>> SgmlLinkExtractor(allow=r'/page/\d+').extract_links(response) 
[Link(url='http://techcrunch.com/page/2/', text=u'Next Page',...)]