0
我建立,通過從網站上幾個分頁頁面和提取數據穿越蜘蛛不願透露姓名的div Scrapy正確的XPath: http://www.usnews.com/education/best-global-universities/neuroscience-behavior用圖片和文字
這是蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
from usnews.items import UsnewsItem
class UniversitiesSpider(scrapy.Spider):
name = "universities"
allowed_domains = ["usnews.com"]
start_urls = (
'http://www.usnews.com/education/best-global-universities/neuroscience-behavior/',
)
#Rules = [
#Rule(LinkExtractor(allow=(), restrict_xpaths=('.//a[@class="pager_link"]',)), callback="parse", follow= True)
#]
def parse(self, response):
for sel in response.xpath('.//div[@class="sep"]'):
item = UsnewsItem()
item['name'] = sel.xpath('.//h2[@class="h-taut"]/a/text()').extract()
item['location'] = sel.xpath('.//span[@class="t-dim t-small"]/text()').extract()
item['ranking'] = sel.xpath('.//div[3]/div[2]/text()').extract()
item['score'] = sel.xpath('.//div[@class="t-large t-strong t-constricted"]/text()').extract()
#print(sel.xpath('.//text()').extract()
yield item
我我在提取項目「排名」的文本時遇到問題。根據google chomes xpath建議,xpath是://*[@id="resultsMain"]/div[1]/div[1]/div[3]/div[2]
它給了我第一個條目的單個數字和一堆空值。它似乎是在一個img標籤裏面實現的,我很困惑如何訪問它以提取文本(例如#1,#22等)