2016-01-24 99 views
1

我想抓取一個本地xml文件,該文件位於scrapy的Downloads文件夾中,使用xpath提取相關信息。使用Scrapy抓取本地XML文件 - 起始URL本地文件地址

使用scrapy介紹爲guide

2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml' 
2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml' 
2016-01-24 12:38:53 [scrapy] ERROR: Error downloading <GET file://home/sayth/Downloads/20160123RAND0.xml> 

我曾嘗試下面幾個版本,但是我現在無法獲得起始URL接受我的文件。

# -*- coding: utf-8 -*- 
import scrapy 


class MyxmlSpider(scrapy.Spider): 
    name = "myxml" 
    allowed_domains = ["file://home/sayth/Downloads"] 
    start_urls = (
     'http://www.file://home/sayth/Downloads/20160123RAND0.xml', 
    ) 

    def parse(self, response): 
     for file in response.xpath('//meeting'): 
      full_url = response.urljoin(href.extract()) 
      yield scrapy.Request(full_url, callback=self.parse_question) 

    def parse_xml(self, response): 
     yield { 
      'name': response.xpath('//meeting/race').extract() 
     } 

只是爲了確認我有在該位置的文件

[email protected] : ~/Downloads 
[0] % ls -a 
.                Building a Responsive Website with Bootstrap [Video].zip 
..                codemirror.zip 
1.1 Situation Of Long Term Gain.xls       Complete-Python-Bootcamp-master.zip 
2008 Racedata.xls            Cox Plate 2005.xls 
20160123RAND0.xml 

回答

5

完全不指定allowed_domains和使用3協議後斜線

start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]