2017-08-01 56 views
0

我是scrapy的新手,正在努力加入與錯誤的絕對和相對鏈接:請求URL中缺少方案。這很奇怪,當我打印URL時,它似乎是正確的URL。Scrapy:結合絕對和相對鏈接 - 缺少模式

我已經嘗試了一些不同的解決方案從stackoverflow,似乎沒有任何進展,任何幫助將不勝感激!

我的代碼:

import scrapy 

class CHSpider(scrapy.Spider): 
    name = "ch_companydata" 
    allowed_domains = ['download.companieshouse.gov.uk'] 
    start_urls = ['http://download.companieshouse.gov.uk/en_output.html'] 

    custom_settings = { 
     'DOWNLOAD_WARNSIZE': 0 
    } 

    def parse(self, response): 
     relative_url = response.xpath("//div[@class='grid_7 push_1 omega']/ul[2]/li[1]/a/@href").extract()[0] 
     download_url = response.urljoin(relative_url) 
     print(download_url) 
     yield { 
      'file_urls': download_url 
     } 

錯誤消息:

2017-08-01 09:46:36 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: companieshouse) 
 
2017-08-01 09:46:36 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'companieshouse.spiders', 'SPIDER_MODULES': ['companieshouse.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'companieshouse'} 
 
2017-08-01 09:46:36 [scrapy.middleware] INFO: Enabled extensions: 
 
['scrapy.extensions.logstats.LogStats', 
 
'scrapy.extensions.telnet.TelnetConsole', 
 
'scrapy.extensions.corestats.CoreStats'] 
 
2017-08-01 09:46:37 [scrapy.middleware] INFO: Enabled downloader middlewares: 
 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
 
2017-08-01 09:46:37 [scrapy.middleware] INFO: Enabled spider middlewares: 
 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
 
2017-08-01 09:46:37 [scrapy.middleware] INFO: Enabled item pipelines: 
 
['scrapy.pipelines.files.FilesPipeline'] 
 
2017-08-01 09:46:37 [scrapy.core.engine] INFO: Spider opened 
 
2017-08-01 09:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
 
2017-08-01 09:46:37 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
 
2017-08-01 09:46:37 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://download.companieshouse.gov.uk/robots.txt> (referer: None) 
 
2017-08-01 09:46:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://download.companieshouse.gov.uk/en_output.html> (referer: None) 
 
http://download.companieshouse.gov.uk/BasicCompanyData-2017-08-01-part1_5.zip 
 
2017-08-01 09:46:37 [scrapy.core.scraper] ERROR: Error processing {'file_urls': u'http://download.companieshouse.gov.uk/BasicCompanyData-2017-08-01-part1_5.zip'} 
 
Traceback (most recent call last): 
 
    File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks 
 
    current.result = callback(current.result, *args, **kw) 
 
    File "c:\python27\lib\site-packages\scrapy\pipelines\media.py", line 79, in process_item 
 
    requests = arg_to_iter(self.get_media_requests(item, info)) 
 
    File "c:\python27\lib\site-packages\scrapy\pipelines\files.py", line 382, in get_media_requests 
 
    return [Request(x) for x in item.get(self.files_urls_field, [])] 
 
    File "c:\python27\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__ 
 
    self._set_url(url) 
 
    File "c:\python27\lib\site-packages\scrapy\http\request\__init__.py", line 58, in _set_url 
 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
 
ValueError: Missing scheme in request url: h 
 
2017-08-01 09:46:37 [scrapy.core.engine] INFO: Closing spider (finished) 
 
2017-08-01 09:46:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
 
{'downloader/request_bytes': 480, 
 
'downloader/request_count': 2, 
 
'downloader/request_method_count/GET': 2, 
 
'downloader/response_bytes': 8455, 
 
'downloader/response_count': 2, 
 
'downloader/response_status_count/200': 1, 
 
'downloader/response_status_count/404': 1, 
 
'finish_reason': 'finished', 
 
'finish_time': datetime.datetime(2017, 8, 1, 8, 46, 37, 415000), 
 
'log_count/DEBUG': 3, 
 
'log_count/ERROR': 1, 
 
'log_count/INFO': 7, 
 
'response_received_count': 2, 
 
'scheduler/dequeued': 1, 
 
'scheduler/dequeued/memory': 1, 
 
'scheduler/enqueued': 1, 
 
'scheduler/enqueued/memory': 1, 
 
'start_time': datetime.datetime(2017, 8, 1, 8, 46, 37, 69000)} 
 
2017-08-01 09:46:37 [scrapy.core.engine] INFO: Spider closed (finished)

回答

0

file_urls字段需要包含的URL列表。
所以你應該產生這種紫斑:

yield { 
     'file_urls': [download_url] 
    }