2016-09-19 35 views
0

有一個website帶有一些我想從中提取數據的交互式圖表。我在使用selenium webdriver的python之前編寫了幾個web scraper,但這似乎是一個不同的問題。我已經看了一些關於stackoverflow的類似問題。從這些看來,解決方案可能是直接從json文件下載數據。我查看了網站的源代碼並確定了幾個json文件,但經過檢查,他們似乎沒有包含這些數據。從交互式圖表中刮掉數據

有誰知道如何從這些圖表下載數據?特別是我感興趣的這個柱狀圖中:.//*[@id='network_download']

感謝

編輯:我要補充的是,當我使用Firebug檢查的網站,我看到炎可能以以下格式獲取數據。但是這顯然沒有幫助,因爲它不包含任何標籤。

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;"> 
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;"> 

回答

0

像這樣的SVG圖表往往有點難以刮取。只有用鼠標實際懸停各個元素後,纔會顯示您想要的數字。

要得到你需要

  1. 數據查找所有點
  2. 對於dots_list每個點的列表中,單擊或懸停(動作鏈)網點
  3. 刮在工具提示中值彈出

這個工作對我來說:

from __future__ import print_function 

from pprint import pprint as pp 

from selenium import webdriver 
from selenium.webdriver.common.action_chains import ActionChains 


def main(): 
    driver = webdriver.Chrome() 
    ac = ActionChains(driver) 

    try: 
     driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/") 

     dots_css = "div#network_download g g.dots_container circle" 
     dots_list = driver.find_elements_by_css_selector(dots_css) 

     print("Found {0} data points".format(len(dots_list))) 

     download_speeds = list() 
     for index, _ in enumerate(dots_list, 1): 
      # Because this is an SVG chart, and because we need to hover it, 
      # it is very likely that the elements will go stale as we do this. For 
      # that reason we need to require each dot element right before we click it 
      single_dot_css = dots_css + ":nth-child({0})".format(index) 
      dot = driver.find_element_by_css_selector(single_dot_css) 
      dot.click() 

      # Scrape the text from the popup 
      popup_css = "div#network_download div.tooltip" 
      popup_text = driver.find_element_by_css_selector(popup_css).text 
      pp(popup_text) 
      rank, comp_and_country, speed = popup_text.split("\n") 
      company, country = comp_and_country.split(" in ") 
      speed_dict = { 
       "rank": rank.split(" Globally")[0].strip("#"), 
       "company": company, 
       "country": country, 
       "speed": speed.split("Download speed: ")[1] 
      } 
      download_speeds.append(speed_dict) 

      # Hover away from the tool tip so it clears 
      hover_elem = driver.find_element_by_id("network_download") 
      ac.move_to_element(hover_elem).perform() 

     pp(download_speeds) 

    finally: 
     driver.quit() 

if __name__ == "__main__": 
    main() 

樣本輸出:

(.venv35) ➜ stackoverflow python svg_charts.py 
Found 182 data points 
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps' 
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps' 
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps' 
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps' 
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps' 
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps' 
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps' 
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps' 
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps' 
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps' 
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps' 
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps' 
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps' 
<...> 
[{'company': 'SingTel', 
    'country': 'Singapore', 
    'rank': '1', 
    'speed': '40 Mbps'}, 
{'company': 'StarHub', 
    'country': 'Singapore', 
    'rank': '2', 
    'speed': '39 Mbps'}, 
{'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'} 
... 
] 

應當注意的是,你在問題中所引用的值,在圈內的元素,並不是特別有用,因爲這些只是說明如何在SVG圖表中畫出點。