2017-01-16 96 views
0

我試圖從http://music.163.com/#/artist?id=16686得到歌曲& singerdata,但我無法得到正確的回覆。scrapy無法得到正確的迴應

我檢查了scrapy外殼,當我請求「music.163.com/#/artist?id=16686」時,響應是「music.163.com」。我不知道原因。

下面是日誌

C:\Users\lszxw\PycharmProjects\untitled\scrapy\tutorial\tutorial>scrapy shell http://music.163.com/#/artist/album?id=16686 
2017-01-16 22:47:03 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: tutorial) 
2017-01-16 22:47:03 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'LOGSTATS_INTERVAL': 0, 'BOT_NAME': 'tutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True} 
2017-01-16 22:47:03 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.telnet.TelnetConsole'] 
2017-01-16 22:47:03 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-01-16 22:47:03 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-01-16 22:47:03 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-01-16 22:47:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-01-16 22:47:03 [scrapy.core.engine] INFO: Spider opened 
2017-01-16 22:47:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://music.163.com/robots.txt> (referer: None) 
2017-01-16 22:47:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://music.163.com/#/artist/album?id=16686> (referer: None) 
2017-01-16 22:47:04 [traitlets] DEBUG: Using default logger 
2017-01-16 22:47:04 [traitlets] DEBUG: Using default logger 
[s] Available Scrapy objects: 
[s] scrapy  scrapy module (contains scrapy.Request, scrapy.Selector, etc) 
[s] crawler <scrapy.crawler.Crawler object at 0x0000018893DAB9E8> 
[s] item  {} 
[s] request <GET http://music.163.com/#/artist/album?id=16686> 
[s] response <200 http://music.163.com/> 
[s] settings <scrapy.settings.Settings object at 0x0000018893DCEF98> 
[s] spider  <DefaultSpider 'default' at 0x18893fd39b0> 
[s] Useful shortcuts: 
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) 
[s] fetch(req)     Fetch a scrapy.Request and update local objects 
[s] shelp()   Shell help (print this help) 
[s] view(response) View response in a browser 

下面是我的代碼,它包含了真實的URL地址:

import scrapy 

class KokiaSpider(scrapy.Spider): 
    name = 'kokia' 


    def start_requests(self): 
     start_urls = ["http://music.163.com/#/artist/album?id=16686&limit=12&offset=0", 
        "http://music.163.com/#/artist/album?id=16686&limit=12&offset=12", 
        "http://music.163.com/#/artist/album?id=16686&limit=12&offset=24", 
        "http://music.163.com/#/artist/album?id=16686&limit=12&offset=36", 
        "http://music.163.com/#/artist/album?id=16686&limit=12&offset=48", 
        "http://music.163.com/#/artist/album?id=16686&limit=12&offset=60",] 

     start_urls=["http://music.163.com/#/artist/album?id=16686", 
        ] 
     for url in start_urls: 
      yield scrapy.Request(url=url, callback=self.parse) 

    def parse(self, response): 
     #url = 'http://music.163.com/#' 
     for item in response.xpath('//*[@id="m-song-module"]/li/p[1]/a/@href'): 
      # full_url = url + item.extract() 
      full_url = response.urljoin(item.extract()) 
      self.log('full_url %s' %full_url) 
      # yield scrapy.Request(full_url,callback=self.parse_album) 

    def parse_album(self,response): 
     for item in response.xpath('//table[@class="m-table"]/tbody/tr/td[2]//a/@href'): 
      # full_url = url + item.extract() 
      full_url = response.urljoin(item.extract()) 
      self.log('full_url %s' %full_url) 
      yield scrapy.Request(full_url,callback=self.parse_song) 
    def parse_song(self,response): 
     song_name = response.xpath('//div[@class="hd"]/div/em/text()').extract_first() 
     singer_name = response.xpath('//p[@class="s-fc4"][1]/span/a/text()').extract_first() 
     album_name = response.xpath('//p[@class="s-fc4"][2]/a/text()').extract_first() 
     comments_num = response.xpath('//*[@id="cnt_comment_count"]/text()') 
     yield{ 
      "song:":song_name, 
      "singer:":singer_name, 
      "album:":album_name, 
      "comments:":comments_num 
     } 

回答

0

start_urls似乎是不正確的。如果你檢查網絡選項卡和頁面的源代碼,你會發現,專輯/歌曲數據實際上是包含在標籤,這導致幾乎相同的URL只是沒有#

"http://music.163.com/#/artist/album?id=16686" 
# becomes: 
"http://music.163.com/artist/album?id=16686" 

之後你的歌在parse_album中的xpath不正確。我知道了這一個工作:

"//ul[@class='f-hide']/li/a/@href[contains(.,'song')]" 

之後一切似乎是工作。

+0

感謝您的分析。我會再次修改和測試。 –

+0

Thx Granit。但似乎我無法獲得「http://music.163.com/#/song?id=26113572」上的評論編號。我可以得到的HTML消息是 61,但不能通過使用'// span [@ id =「cnt_comment_count」]/text()'得到數字'是否由於Ajax' #「? –