2017-04-20 136 views
1

在用於URL http://www.apkmirror.com/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/foldinghome-1-00-40-android-apk-download/的Scrapy shell中,我試圖從導航欄中提取開發人員,應用和版本名稱:在Scrapy中,無法提取帶有「@」的鏈接的文本

enter image description here

我試過以下的XPath選擇:

In [6]: response.xpath('//*[@class="breadcrumbs"]//a/text()').extract() 
Out[6]: [u'Sony Mobile Communications', u'1.00.40'] 

但是,請注意應用程序名稱,[email protected],不是結果中。我不明白這一點,因爲它似乎有一個<a>標籤(如圖所示使用「檢查」在Chrome):

enter image description here

而且,類似的網站,http://www.apkmirror.com/apk/oculus-vr/oculus-rooms/oculus-rooms-0-0-2-release/oculus-rooms-0-0-2-android-apk-download/,這個選擇沒有問題:

In [1]: response.xpath('//*[@class="breadcrumbs"]//a/text()').extract() 
Out[1]: [u'Oculus VR', u'Oculus Rooms', u'0.0.2'] 

我開始懷疑,這可能是某種在Scrapy的錯誤,由此它不與@符號選擇<a>text()的元素。可能是這樣嗎?

回答

1

正如您已經發現的那樣,其中一個麪包屑鏈接是「受保護的」,並且是通過在瀏覽器中執行的JavaScript動態構建的。

解決此問題的一個簡單方法是通過scrapy-splash中間件將頁面內容傳遞給Splash。這爲我工作:

import scrapy 
from scrapy_splash import SplashRequest 


class ApkSpider(scrapy.Spider): 
    name = "apkmirror" 
    allowed_domains = ['apkmirror.com'] 

    def start_requests(self): 
     yield SplashRequest(
      'http://www.apkmirror.com/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/foldinghome-1-00-40-android-apk-download/', 
      self.parse_result, 
      ) 

    def parse_result(self, response): 
     print(response.xpath('//*[@class="breadcrumbs"]//a/text()').extract()) 

以下設置:

SPLASH_URL = 'http://127.0.0.1:8050' 
SPLASH_COOKIES_DEBUG = True 

DOWNLOADER_MIDDLEWARES = { 
    'scrapy_splash.SplashCookiesMiddleware': 723, 
    'scrapy_splash.SplashMiddleware': 725, 
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
} 

SPIDER_MIDDLEWARES = { 
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
} 

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 

Splash端口8050在泊塢窗容器中運行。

打印:

[u'Sony Mobile Communications', u'[email protected]', u'1.00.40'] 
1

查看使用Chrome瀏覽器的「查看頁面源代碼」選項,而不是「檢查」頁面的源代碼,我看到了這個特殊的鏈接導航欄包含的JavaScript:

<nav style="margin-left:16px; margin-right:16px;" class="navbar navbar-default" role="navigation"> 
<div style="color: #013967 !important;" class="breadcrumbs"><a class="withoutripple" style="color: #013967 !important;" href="/apk/sony-mobile-communications/">Sony Mobile Communications</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="withoutripple " style="color: #013967 !important;" href="/apk/sony-mobile-communications/foldinghome/"><span class="__cf_email__" data-cfemail="c781a8aba3aea9a0878fa8aaa2">[email&#160;protected]</span><script data-cfhash='f9e31' type="text/javascript">/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */</script></a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="active withoutripple" style="color: #013967 !important;" href="/apk/sony-mobile-communications/foldinghome/foldinghome-1-00-40-release/">1.00.40</a> </nav> 

,而在魔環間頁面第二個例子是沒有:

<nav style="margin-left:16px; margin-right:16px;" class="navbar navbar-default" role="navigation"> 
<div style="color: #646464 !important;" class="breadcrumbs"><a class="withoutripple" style="color: #646464 !important;" href="/apk/oculus-vr/">Oculus VR</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="withoutripple " style="color: #646464 !important;" href="/apk/oculus-vr/oculus-rooms/">Oculus Rooms</a> <svg class="icon chevron-icon"><use xlink:href="#apkm-icon-chevron"></use></svg> <a class="active withoutripple" style="color: #646464 !important;" href="/apk/oculus-vr/oculus-rooms/oculus-rooms-0-0-2-release/">0.0.2</a> </nav> 

與Scrapy處理JavaScript是一個已知的問題(參見https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/)。