2016-05-12 58 views
0

我建立,通過從網站上幾個分頁頁面和提取數據穿越蜘蛛不願透露姓名的div Scrapy正確的XPath: http://www.usnews.com/education/best-global-universities/neuroscience-behavior用圖片和文字

這是蜘蛛:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.contrib.spiders import Rule 
from scrapy.linkextractors import LinkExtractor 
from lxml import html 
from usnews.items import UsnewsItem 


class UniversitiesSpider(scrapy.Spider): 
    name = "universities" 
    allowed_domains = ["usnews.com"] 
    start_urls = (
     'http://www.usnews.com/education/best-global-universities/neuroscience-behavior/', 
     ) 

    #Rules = [ 
    #Rule(LinkExtractor(allow=(), restrict_xpaths=('.//a[@class="pager_link"]',)), callback="parse", follow= True) 
    #] 

    def parse(self, response): 
     for sel in response.xpath('.//div[@class="sep"]'): 
      item = UsnewsItem() 
      item['name'] = sel.xpath('.//h2[@class="h-taut"]/a/text()').extract() 
      item['location'] = sel.xpath('.//span[@class="t-dim t-small"]/text()').extract() 
      item['ranking'] = sel.xpath('.//div[3]/div[2]/text()').extract() 
      item['score'] = sel.xpath('.//div[@class="t-large t-strong t-constricted"]/text()').extract() 
      #print(sel.xpath('.//text()').extract() 
      yield item 

我我在提取項目「排名」的文本時遇到問題。根據google chomes xpath建議,xpath是://*[@id="resultsMain"]/div[1]/div[1]/div[3]/div[2]它給了我第一個條目的單個數字和一堆空值。它似乎是在一個img標籤裏面實現的,我很困惑如何訪問它以提取文本(例如#1,#22等)

回答

1

以下XPath應該找到div包含img子,然後返回包含'排名'的非空文本節點子節點:

for sel in response.xpath('.//div[@class="sep"]'): 
    ... 
    item['ranking'] = sel.xpath('div/div[img]/text()[normalize-space()]').extract()