如何使用python從網站刮圖表？

所以我有以下保存到一個文本文件，但使用重新提取數據仍然不返回我什麼腳本代碼。我的代碼是：

file_object = open('source_test_script.txt', mode="r") 
soup = BeautifulSoup(file_object, "html.parser") 
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL) 
scripts = soup.find("script", text=pattern) 
profile_text = pattern.search(scripts.text).group(1) 
profile = json.loads(profile_text) 

print profile["data"], profile["categories"]

我想從網站中提取該圖表的數據。以下是圖表的源代碼。

<script type="text/javascript"> 
    jQuery(function() { 

    var chart1 = new Highcharts.Chart({ 

      chart: { 
      renderTo: 'chart1', 
       defaultSeriesType: 'column', 
      borderWidth: 2 
      }, 
      title: { 
      text: 'Productions' 
      }, 
      legend: { 
      enabled: false 
      }, 
      xAxis: [{ 
      categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016], 

      }], 
      yAxis: { 
      min: 0, 
      title: { 
      text: 'Productions' 
      } 
      }, 

      series: [{ 
       name: 'Productions', 
       data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36] 
       }] 
     }); 
    }); 

    </script>

有幾個圖表，例如，從網站，叫「chart1」，「chart2」等我想提取如下的數據：類線和數據線，每個圖表：

categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016] 

data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]

來源

2016-10-05 Ilumtics

我相信你可以使用硒這樣的東西，例如：http://stackoverflow.com/questions/10455130/can-selenium-web-driver-have-access-to-javascript-global-variables – CasualDemon

是啊我使用硒來解析html內容。我的代碼是： [code] req = urllib2.Request（productions_url，headers = {'User-Agent'：'Mozilla/5.0（X11; Linux x86_64; rv：27.0）Gecko/20100101 Firefox/27.0'}） p = urllib2.urlopen（req） soup = BeautifulSoup（p.readlines（）[0]，'html.parser'）[/ code]。我的問題是一旦我解析HTML，如何提取這2個特定的行。 – Ilumtics

HTML解析器不會幫助你，因爲那是JavaScript。所以，你必須自己解析它。 – zvone

我會結合使用正則表達式和yaml解析器。快速及以下髒 - 你可能需要tweek的正則表達式，但它與示例工作：

import re 
import sys 
import yaml 

chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$', 
     re.MULTILINE | re.DOTALL) 

script = sys.stdin.read() 

m = chart_matcher.findall(script) 

for name, data in m: 
    print name 
    try: 
     chart = yaml.safe_load(data) 
     print "categories:", chart['xAxis'][0]['categories'] 
     print "data:", chart['series'][0]['data'] 
    except Exception, e: 
     print e

要求YAML庫（pip install PyYAML），你應該使用BeautifulSoup它傳遞給正則表達式之前提取正確<script>標籤。

編輯 - 完整的例子

對不起，我沒有讓自己清楚。您使用BeautifulSoup解析HTML並提取<script>元素，然後使用PyYAML解析javascript對象聲明。你不能使用內置的json庫，因爲它不是有效的JSON，而是簡單的JavaScript對象聲明（即沒有函數）是YAML的一個子集。

from bs4 import BeautifulSoup 
import yaml 
import re 

file_object = open('source_test_script.txt', mode="r") 
soup = BeautifulSoup(file_object, "html.parser") 

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) 

charts = {} 

# find every <script> tag in the source using beautifulsoup 
for tag in soup.find_all('script'): 

    # tabs are special in yaml so remove them first 
    script = tag.text.replace('\t', '') 

    # find each object declaration 
    for name, obj_declaration in pattern.findall(script): 
     try: 
      # parse the javascript declaration 
      charts[name] = yaml.safe_load(obj_declaration) 
     except Exception, e: 
      print "Failed to parse {0}: {1}".format(name, e) 

# extract the data you want 
for name in charts: 
    print "## {0} ##".format(name); 
    print "categories:", charts[name]['xAxis'][0]['categories'] 
    print "data:", charts[name]['series'][0]['data'] 
    print

輸出：

## chart1 ## 
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016] 
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]

注意我不得不tweek正則表達式，使其處理來自BeautifulSoup unicode的輸出和空白 - 在我原來的例子我只是管道源直接將正則表達式。

EDIT 2 - 沒有YAML

鑑於JavaScript的看起來是部分產生的，你可以期待的最好是搶線 - 不優雅，但可能會爲你工作。

from bs4 import BeautifulSoup 
import json 
import re 

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r") 
soup = BeautifulSoup(file_object, "html.parser") 

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) 

charts = {} 

for tag in soup.find_all('script'): 

    # tabs are special in yaml so remove them first 
    script = tag.text 

    values = {} 

    # find each object declaration 
    for name, obj_declaration in pattern.findall(script): 
     for line in obj_declaration.split('\n'): 
      line = line.strip('\t\n ,;') 
      for field in ('data', 'categories'): 
       if line.startswith(field + ":"): 
        data = line[len(field)+1:] 
        try: 
         values[field] = json.loads(data) 
        except: 
         print "Failed to parse %r for %s" % (data, name) 

     charts[name] = values 

print charts

請注意，它因爲引用另一個變量而導致chart7失敗。

來源

2016-10-05 05:53:05

所以我把下面的腳本代碼保存到一個文本文件中，但是使用re提取數據仍然不會返回任何東西。我的代碼是： file_object = open（'source_test_script.txt'，mode =「r」） soup = BeautifulSoup（file_object，「html.parser」） pattern = re.compile（r「^ var（chart [0 -9] +）= new Highcharts.Chart \（（{。*？}）\）; $「，re.MULTILINE | re.DOTALL） scripts = soup.find（」script「，text = pattern） profile_text = pattern.search（scripts.text）.group（1） profile = json.loads（profile_text）打印配置文件[「data」]，profile [「categories」] – Ilumtics

我嘗試了代碼，：「解析chart1時失敗：解析」「中第29行第16列的流映射： tooltip：{ ^ expected'，'or'}'，but got'{'「 – Ilumtics

您可能仍然希望使用'yaml.safe_load'而不是'json.loads'，因爲它對錯誤的輸入更爲寬容（chart3例如在數組中有尾隨逗號） –

我想如你所說的代碼，但一直得到這樣的：

"Failed to parse chart1: while parsing a flow mapping 
    in "<unicode string>", line 29, column 16: 
      tooltip: { 
       ^
expected ',' or '}', but got '{'"

我的整個代碼：

import json 
import urllib2 
import re 
import sys 
import yaml 
from selenium.webdriver import Chrome as Browser 
from bs4 import BeautifulSoup 


URL = ("http://citec.repec.org/p/c/pcl20.html") 
browser=Browser() 
browser.get(URL) 
source = browser.page_source 

#file_object = open('source_test.txt', mode="r") 
soup = BeautifulSoup(source, "html.parser") 

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) 

charts = {} 

# find every <script> tag in the source using beautifulsoup 
for tag in soup.find_all('script'): 

    # tabs are special in yaml so remove them first 
    script = tag.text.replace('\t', '') 

    # find each object declaration 
    for name, obj_declaration in pattern.findall(script): 
     try: 
      # parse the javascript declaration 
      charts[name] = yaml.safe_load(obj_declaration) 
     except Exception, e: 
      print "Failed to parse {0}: {1}".format(name, e) 

# extract the data you want 
for name in charts: 
    print "## {0} ##".format(name); 
    print "categories:", charts[name]['xAxis'][0]['categories'] 
    print "data:", charts[name]['series'][0]['data'] 
    print

來源

2016-10-06 03:32:31 Ilumtics

該頁面上的JavaScript動態生成數據 - 害怕你別無選擇，只能找到一種在JavaScript引擎中執行它的方法（Selenium？），然後找到一種方法來檢查窗口屬性。我不熟悉Selenium，因此在這種情況下我恐怕無法提供幫助。 –

是的，我認爲存在這個問題。但即使在將腳本保存到文本文件後，我也得到了與上述代碼相同的錯誤。繼續獲取：無法解析chart1：在解析「」中第29行第16列的流映射工具提示：{ ^ 預期'，'或'}'，但得到'{'「 – Ilumtics

有沒有辦法讓我只使用正則表達式，例如，從保存的源代碼中提取這些數據？我用selenium將源代碼保存爲：browser = Browser（） browser.get（URL） source = browser .page_source with open（「source_test.txt」，「wb」）as outfile： outfile.write（source.encode（'utf-8'）） – Ilumtics

另一種方法是使用Highcharts' JavaScript庫作爲一個將在控制檯和拉硒，使用硒。

import time 
from selenium import webdriver 

website = "" 

driver = webdriver.Firefox() 
driver.get(website) 
time.sleep(5) 

temp = driver.execute_script('return window.Highcharts.charts[0]' 
          '.series[0].options.data') 
data = [item[1] for item in temp] 
print(data)

根據你試圖拉你的情況的圖表和系列可能會略有不同。

來源

2017-07-29 03:13:39

這應該是公認的答案！更簡單，更直觀。 – ahlexander

如何使用python從網站刮圖表？

回答

相關問題