2013-11-14 75 views
2

我怎樣才能湊基金的價格:如何使用Pandas read_html和請求庫來讀取表格?

http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U

這是錯誤的,但我怎麼修改:

import pandas as pd 
import requests 
import re 
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U' 
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")}) 
+0

這是一個相當凌亂的HTML,我想你會需要探索XML樹搶正確的值。 attr類應該放在桌子上而不是單元格(我認爲)... –

+0

對不起。這是否意味着我必須導入BeautifulSoup4?任何建議? –

+0

免責聲明:我可能是錯的,可能有一個簡單的方法來獲取read_html來抓住這個。如果沒有,我想像這樣:http://stackoverflow.com/a/16993660/1240268,但它有點雜亂/尷尬。 –

回答

2

我喜歡LXML解析和查詢HTML。以下是我想出了:

import requests 
from lxml import etree 

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U' 
doc = requests.get(url) 
tree = etree.HTML(doc.content) 

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]' 

rows = tree.xpath(row_xpath) 

for row in rows: 
    (date_string, v1, v2) = (td.text for td in row.getchildren()) 
    print "%s - %s - %s" % (date_string, v1, v2) 
1

我的解決辦法是與你相似:

import pandas as pd 
import requests 
from lxml import etree 

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U" 
r = requests.get(url) 
html = etree.HTML(r.content) 
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()') 

if len(data) % 3 == 0: 
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask']) 
    df = df.set_index('date') 
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y') 
    df.sort_index(inplace = True) 
相關問題