2016-10-05 61 views
0

編輯:如何使用python從網站刮圖表?

所以我有以下保存到一個文本文件,但使用重新提取數據仍然不返回我什麼腳本代碼。我的代碼是:

file_object = open('source_test_script.txt', mode="r") 
soup = BeautifulSoup(file_object, "html.parser") 
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL) 
scripts = soup.find("script", text=pattern) 
profile_text = pattern.search(scripts.text).group(1) 
profile = json.loads(profile_text) 

print profile["data"], profile["categories"] 

我想從網站中提取該圖表的數據。以下是圖表的源代碼。

<script type="text/javascript"> 
    jQuery(function() { 

    var chart1 = new Highcharts.Chart({ 

      chart: { 
      renderTo: 'chart1', 
       defaultSeriesType: 'column', 
      borderWidth: 2 
      }, 
      title: { 
      text: 'Productions' 
      }, 
      legend: { 
      enabled: false 
      }, 
      xAxis: [{ 
      categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016], 

      }], 
      yAxis: { 
      min: 0, 
      title: { 
      text: 'Productions' 
      } 
      }, 

      series: [{ 
       name: 'Productions', 
       data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36] 
       }] 
     }); 
    }); 

    </script> 

有幾個圖表,例如,從網站,叫「chart1」,「chart2」等我想提取如下的數據:類線和數據線,每個圖表:

categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016] 

data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36] 
+0

我相信你可以使用硒這樣的東西,例如:http://stackoverflow.com/questions/10455130/can-selenium-web-driver-have-access-to-javascript-global-variables – CasualDemon

+0

是啊我使用硒來解析html內容。我的代碼是: [code] req = urllib2.Request(productions_url,headers = {'User-Agent':'Mozilla/5.0(X11; Linux x86_64; rv:27.0)Gecko/20100101 Firefox/27.0'}) p = urllib2.urlopen(req) soup = BeautifulSoup(p.readlines()[0],'html.parser')[/ code]。我的問題是一旦我解析HTML,如何提取這2個特定的行。 – Ilumtics

+0

HTML解析器不會幫助你,因爲那是JavaScript。所以,你必須自己解析它。 – zvone

回答

0

我會結合使用正則表達式和yaml解析器。快速及以下髒 - 你可能需要tweek的正則表達式,但它與示例工作:

import re 
import sys 
import yaml 

chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$', 
     re.MULTILINE | re.DOTALL) 

script = sys.stdin.read() 

m = chart_matcher.findall(script) 

for name, data in m: 
    print name 
    try: 
     chart = yaml.safe_load(data) 
     print "categories:", chart['xAxis'][0]['categories'] 
     print "data:", chart['series'][0]['data'] 
    except Exception, e: 
     print e 

要求YAML庫(pip install PyYAML),你應該使用BeautifulSoup它傳遞給正則表達式之前提取正確<script>標籤。

編輯 - 完整的例子

對不起,我沒有讓自己清楚。您使用BeautifulSoup解析HTML並提取<script>元素,然後使用PyYAML解析javascript對象聲明。你不能使用內置的json庫,因爲它不是有效的JSON,而是簡單的JavaScript對象聲明(即沒有函數)是YAML的一個子集。

from bs4 import BeautifulSoup 
import yaml 
import re 

file_object = open('source_test_script.txt', mode="r") 
soup = BeautifulSoup(file_object, "html.parser") 

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) 

charts = {} 

# find every <script> tag in the source using beautifulsoup 
for tag in soup.find_all('script'): 

    # tabs are special in yaml so remove them first 
    script = tag.text.replace('\t', '') 

    # find each object declaration 
    for name, obj_declaration in pattern.findall(script): 
     try: 
      # parse the javascript declaration 
      charts[name] = yaml.safe_load(obj_declaration) 
     except Exception, e: 
      print "Failed to parse {0}: {1}".format(name, e) 

# extract the data you want 
for name in charts: 
    print "## {0} ##".format(name); 
    print "categories:", charts[name]['xAxis'][0]['categories'] 
    print "data:", charts[name]['series'][0]['data'] 
    print 

輸出:

## chart1 ## 
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016] 
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36] 

注意我不得不tweek正則表達式,使其處理來自BeautifulSoup unicode的輸出和空白 - 在我原來的例子我只是管道源直接將正則表達式。

EDIT 2 - 沒有YAML

鑑於JavaScript的看起來是部分產生的,你可以期待的最好是搶線 - 不優雅,但可能會爲你工作。

from bs4 import BeautifulSoup 
import json 
import re 

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r") 
soup = BeautifulSoup(file_object, "html.parser") 

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) 

charts = {} 

for tag in soup.find_all('script'): 

    # tabs are special in yaml so remove them first 
    script = tag.text 

    values = {} 

    # find each object declaration 
    for name, obj_declaration in pattern.findall(script): 
     for line in obj_declaration.split('\n'): 
      line = line.strip('\t\n ,;') 
      for field in ('data', 'categories'): 
       if line.startswith(field + ":"): 
        data = line[len(field)+1:] 
        try: 
         values[field] = json.loads(data) 
        except: 
         print "Failed to parse %r for %s" % (data, name) 

     charts[name] = values 

print charts 

請注意,它因爲引用另一個變量而導致chart7失敗。

+0

所以我把下面的腳本代碼保存到一個文本文件中,但是使用re提取數據仍然不會返回任何東西。我的代碼是: file_object = open('source_test_script.txt',mode =「r」) soup = BeautifulSoup(file_object,「html.parser」) pattern = re.compile(r「^ var(chart [0 -9] +)= new Highcharts.Chart \(({。*?})\); $「,re.MULTILINE | re.DOTALL) scripts = soup.find(」script「,text = pattern) profile_text = pattern.search(scripts.text).group(1) profile = json.loads(profile_text) 打印配置文件[「data」],profile [「categories」] – Ilumtics

+0

我嘗試了代碼, : 「解析chart1時失敗:解析」「中第29行第16列的流映射 : tooltip:{ ^ expected','or'}',but got'{'「 – Ilumtics

+0

您可能仍然希望使用'yaml.safe_load'而不是'json.loads',因爲它對錯誤的輸入更爲寬容(chart3例如在數組中有尾隨逗號) –

0

我想如你所說的代碼,但一直得到這樣的:

"Failed to parse chart1: while parsing a flow mapping 
    in "<unicode string>", line 29, column 16: 
      tooltip: { 
       ^
expected ',' or '}', but got '{'" 

我的整個代碼:

import json 
import urllib2 
import re 
import sys 
import yaml 
from selenium.webdriver import Chrome as Browser 
from bs4 import BeautifulSoup 


URL = ("http://citec.repec.org/p/c/pcl20.html") 
browser=Browser() 
browser.get(URL) 
source = browser.page_source 

#file_object = open('source_test.txt', mode="r") 
soup = BeautifulSoup(source, "html.parser") 

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) 

charts = {} 

# find every <script> tag in the source using beautifulsoup 
for tag in soup.find_all('script'): 

    # tabs are special in yaml so remove them first 
    script = tag.text.replace('\t', '') 

    # find each object declaration 
    for name, obj_declaration in pattern.findall(script): 
     try: 
      # parse the javascript declaration 
      charts[name] = yaml.safe_load(obj_declaration) 
     except Exception, e: 
      print "Failed to parse {0}: {1}".format(name, e) 

# extract the data you want 
for name in charts: 
    print "## {0} ##".format(name); 
    print "categories:", charts[name]['xAxis'][0]['categories'] 
    print "data:", charts[name]['series'][0]['data'] 
    print 
+0

該頁面上的JavaScript動態生成數據 - 害怕你別無選擇,只能找到一種在JavaScript引擎中執行它的方法(Selenium?),然後找到一種方法來檢查窗口屬性。我不熟悉Selenium,因此在這種情況下我恐怕無法提供幫助。 –

+0

是的,我認爲存在這個問題。但即使在將腳本保存到文本文件後,我也得到了與上述代碼相同的錯誤。繼續獲取:無法解析chart1:在解析「」中第29行第16列的流映射 工具提示:{ ^ 預期','或'}',但得到'{'「 – Ilumtics

+0

有沒有辦法讓我只使用正則表達式,例如,從保存的源代碼中提取這些數據?我用selenium將源代碼保存爲:browser = Browser() browser.get(URL) source = browser .page_source with open(「source_test.txt」,「wb」)as outfile: outfile.write(source.encode('utf-8')) – Ilumtics

3

另一種方法是使用Highcharts' JavaScript庫作爲一個將在控制檯和拉硒,使用硒。

import time 
from selenium import webdriver 

website = "" 

driver = webdriver.Firefox() 
driver.get(website) 
time.sleep(5) 

temp = driver.execute_script('return window.Highcharts.charts[0]' 
          '.series[0].options.data') 
data = [item[1] for item in temp] 
print(data) 

根據你試圖拉你的情況的圖表和系列可能會略有不同。

+0

這應該是公認的答案!更簡單,更直觀。 – ahlexander