2017-08-03 126 views
0

我是Scrapy的新手。我正嘗試使用媒體管道下載文件。但是,當我運行蜘蛛沒有文件存儲在文件夾中。Scrapy Media Pipeline,文件無法下載

蜘蛛:

import scrapy 
from scrapy import Request 
from pagalworld.items import PagalworldItem 

class JobsSpider(scrapy.Spider): 
    name = "songs" 
    allowed_domains = ["pagalworld.me"] 
    start_urls =['https://pagalworld.me/category/11598/Latest%20Bollywood%20Hindi%20Mp3%20Songs%20-%202017.html'] 

    def parse(self, response): 
     urls = response.xpath('//div[@class="pageLinkList"]/ul/li/a/@href').extract() 

     for link in urls: 

      yield Request(link, callback=self.parse_page,) 




    def parse_page(self, response): 
     songName=response.xpath('//li/b/a/@href').extract() 
     for song in songName: 
      yield Request(song,callback=self.parsing_link) 


    def parsing_link(self,response): 
     item= PagalworldItem() 
     item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract() 
     yield{"download_link":item['file_urls']} 

項目文件:

import scrapy 


class PagalworldItem(scrapy.Item): 


    file_urls=scrapy.Field() 

設置文件:

BOT_NAME = 'pagalworld' 

SPIDER_MODULES = ['pagalworld.spiders'] 
NEWSPIDER_MODULE = 'pagalworld.spiders' 
ROBOTSTXT_OBEY = True 
CONCURRENT_REQUESTS = 5 
DOWNLOAD_DELAY = 3 
ITEM_PIPELINES = { 

'scrapy.pipelines.files.FilesPipeline': 1 
} 
FILES_STORE = '/tmp/media/' 

輸出看起來像這樣:enter image description here

+0

你沒有寫任何代碼來下載/保存文件。去這裏,得到一些想法。 https://stackoverflow.com/questions/36135809/using-scrapy-to-to-find-and-download-pdf-files-from-a-website希望這可以幫助 – Nabin

回答

2
def parsing_link(self,response): 
    item= PagalworldItem() 
    item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract() 
    yield{"download_link":item['file_urls']} 

您的收益率:

yield {"download_link": ['http://someurl.com']} 

其中用於scrapy的媒體/文件流水線工作,你需要產生和包含file_urls場項目。所以試試這個:

def parsing_link(self,response): 
    item= PagalworldItem() 
    item['file_urls']=response.xpath('//div[@class="menu_row"]/a[@class="touch"]/@href').extract() 
    yield item 
+0

早些時候,我試圖crawlspider解析,但它沒有' t工作https://stackoverflow.com/questions/45447451/scrapy-results-are-repeating你可以看到它 – emon