與Scrapy

我試圖刮掉所有從本網站的CSV刮痧的CSV：transparentnevada.com與Scrapy

當您導航到一個具體的機構，即http://transparentnevada.com/salaries/2016/university-nevada-reno/，並創下下載記錄，有多項CSV中的鏈接。我想下載所有的CSV。

我的蜘蛛運行，並出現抓取所有的記錄，但不下載任何東西：

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.http import Request 


class Spider2(CrawlSpider): 
    #name of the spider 
    name = 'nevada' 

#list of allowed domains 
allowed_domains = ['transparentnevada.com'] 

#starting url for scraping 
start_urls = ['http://transparentnevada.com/salaries/all/'] 
rules = [ 
    Rule(LinkExtractor(
    allow=['/salaries/all/*']), 
    follow=True), 
    Rule(LinkExtractor(
    allow=['/salaries/2016/*/']), 
    follow=True), 
    Rule(LinkExtractor(
    allow=['/salaries/2016/*/#']), 
    callback='parse_article', 
    follow=True), 
] 

#setting the location of the output csv file 
custom_settings = { 
    'FEED_FORMAT' : "csv", 
    'FEED_URI' : 'tmp/nevada2.csv' 
} 

def parse_article(self, response): 
    for href in response.css('div.view-downloads a[href$=".csv"]::attr(href)').extract(): 
     yield Request(
      url=response.urljoin(href), 
      callback=self.save_pdf 
     ) 

def save_pdf(self, response): 
    path = response.url.split('/')[-1] 
    self.logger.info('Saving CSV %s', path) 
    with open(path, 'wb') as f: 
     f.write(response.body)

來源

2017-08-11 gattoun

日誌？在pastebin上創建它們併發布鏈接 –

的問題是，CSV上/export/和你在你的規則做見死不救

我添加了一個簡單的LinkExtractor你刮，它正在下載文件

Rule(LinkExtractor(
    allow=['/export/.*\.csv']), 
    callback='save_pdf', 
    follow=True),

而且你上面的規則是不正確的100％，你必須使用當它應該是「/.*/」時，d「/ *」。

「/ *」表示斜線存在或者出現多次，如「////」。所以修正你的規則，添加我給的規則，並應該完成工作

來源

2017-08-12 09:07:19

回答

相關問題