如何使用Pandas read_html和請求庫來讀取表格？

我怎樣才能湊基金的價格：如何使用Pandas read_html和請求庫來讀取表格？

http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U

這是錯誤的，但我怎麼修改：

import pandas as pd 
import requests 
import re 
url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U' 
tables = pd.read_html(requests.get(url).text, attrs={"class":re.compile("fundPriceCell\d+")})

來源

2013-11-14 Terence Ng

這是一個相當凌亂的HTML，我想你會需要探索XML樹搶正確的值。 attr類應該放在桌子上而不是單元格（我認爲）... –

對不起。這是否意味着我必須導入BeautifulSoup4？任何建議？ –

免責聲明：我可能是錯的，可能有一個簡單的方法來獲取read_html來抓住這個。如果沒有，我想像這樣：http://stackoverflow.com/a/16993660/1240268，但它有點雜亂/尷尬。 –

我喜歡LXML解析和查詢HTML。以下是我想出了：

import requests 
from lxml import etree 

url = 'http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U' 
doc = requests.get(url) 
tree = etree.HTML(doc.content) 

row_xpath = '//tr[contains(td[1]/@class, "fundPriceCell")]' 

rows = tree.xpath(row_xpath) 

for row in rows: 
    (date_string, v1, v2) = (td.text for td in row.getchildren()) 
    print "%s - %s - %s" % (date_string, v1, v2)

來源

2013-12-06 17:00:26 brechin

我的解決辦法是與你相似：

import pandas as pd 
import requests 
from lxml import etree 

url = "http://www.prudential.com.hk/PruServlet?module=fund&purpose=searchHistFund&fundCd=JAS_U" 
r = requests.get(url) 
html = etree.HTML(r.content) 
data = html.xpath('//table//table//table//table//td[@class="fundPriceCell1" or @class="fundPriceCell2"]//text()') 

if len(data) % 3 == 0: 
    df = pd.DataFrame([data[i:i+3] for i in range(0, len(data), 3)], columns = ['date', 'bid', 'ask']) 
    df = df.set_index('date') 
    df.index = pd.to_datetime(df.index, format = '%d/%m/%Y') 
    df.sort_index(inplace = True)

來源

2013-12-13 02:58:36

如何使用Pandas read_html和請求庫來讀取表格？

回答

相關問題