2017-02-18 89 views
0

我想下載一個使用Scrapy 1.3.2和Python 2.7.13的CSV文件,目前爲止沒有任何運氣。使用Scrapy下載csv文件 - Python

這裏是蜘蛛的代碼:

import scrapy 

class FinancialFilesItem(scrapy.Item): 
     file_urls = scrapy.Field() 
     files = scrapy.Field() 

class FinancialsSpider(scrapy.Spider): 
    name = "Financials Spider" 
    allowed_domains = ["financials.morningstar.com"] 

    def __init__(self, url): 
     super(FinancialsSpider, self).__init__() 
     self.start_urls = url 

    def parse(self, response): 

     result = FinancialFilesItem() 

     result['file_urls'] = [response.url] 
     yield result 

而且這裏的主要代碼,其中蜘蛛叫:

from scrapy.crawler import CrawlerProcess 
from scrapy.settings import Settings 
from scraper.spiders.financialsSpider import FinancialsSpider 


def GetFinancials(url): 

    settings = Settings() 

    settings.set('ITEM_PIPELINES', {'scrapy.pipelines.files.FilesPipeline': 1}) 
    settings.set('FILES_STORE', 'D:/downloads/') 

    process = CrawlerProcess(settings) 

    spider = FinancialsSpider 

    process.crawl(spider, url = url) 
    process.start() 

GetFinancials(["http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB"]) 

這裏是當主代碼運行日誌:

2017-02-18 15:22:38 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot) 
2017-02-18 15:22:38 [scrapy.utils.log] INFO: Overridden settings: {} 
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-02-18 15:22:38 [scrapy.middleware] INFO: Enabled item pipelines: 
['scrapy.pipelines.files.FilesPipeline'] 
2017-02-18 15:22:38 [scrapy.core.engine] INFO: Spider opened 
2017-02-18 15:22:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-02-18 15:22:38 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-02-18 15:22:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> (referer: None) 
2017-02-18 15:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> (referer: None) 
2017-02-18 15:22:40 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> referred in <None> 
2017-02-18 15:22:40 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> referred in <None> 
Traceback (most recent call last): 
    File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded 
    checksum = self.file_downloaded(response, request, info) 
    File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 389, in file_downloaded 
    self.store.persist_file(path, buf, info) 
    File "C:\Python27\lib\site-packages\scrapy\pipelines\files.py", line 54, in persist_file 
    with open(absolute_path, 'wb') as f: 
IOError: [Errno 22] invalid mode ('wb') or filename: 'D:/full\\01958104292b4813abcda051da56e55e72d22fb9.html?t=FB' 
2017-02-18 15:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200 http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB> 
{'file_urls': ['http://financials.morningstar.com/ajax/exportKR2CSV.html?t=FB'], 
'files': []} 
2017-02-18 15:22:40 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-02-18 15:22:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 555, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 5970, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'file_count': 1, 
'file_status_count/downloaded': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 2, 18, 14, 22, 40, 160000), 
'item_scraped_count': 1, 
'log_count/DEBUG': 5, 
'log_count/ERROR': 1, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2017, 2, 18, 14, 22, 38, 826000)} 
2017-02-18 15:22:40 [scrapy.core.engine] INFO: Spider closed (finished) 

感謝您的回答。

回答

0

您是否嘗試輸出爲CSV?

scrapy crawl nameofspider -o file.csv 
0

它在日誌中:

IOError: [Errno 22] invalid mode ('wb') or filename: 'D:/full\\01958104292b4813abcda051da56e55e72d22fb9.html?t=FB' 

變化路徑這是你在Windows

settings.set('FILES_STORE', 'D:\\downloads')