首先選擇使用HREF錨,然後找到前六屆TD的:
from bs4 import BeautifulSoup
import requests
url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'
soup = BeautifulSoup(requests.get(url).content,"html.parser")
anchor = soup.select_one("a[href=./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_YIR_NUS_MBBLD&f=W]")
data = [td.text for td in anchor.find_all_previous("td","DataB", limit=6)]
如果我們運行的代碼,你可以看到我們從前面六個TD的獲取文本:
In [1]: from bs4 import BeautifulSoup
...: import requests
...: url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'
...: soup = BeautifulSoup(requests.get(url).content,"html.parser")
...: anchor = soup.select_one("a[href=./hist/LeafHandler.ashx?n=PET&s=W_EPOOX
...: E_YIR_NUS_MBBLD&f=W]")
...: data = [td.text for td in anchor.find_all_previous("td","DataB", limit=6
...:)]
...:
In [2]: data
Out[2]: ['934', '919', '957', '951', '928', '139']
這並不完全符合那裏,因爲有兩個不同的類爲td的當前2和數據b sdo我們可以使用父這將是一個td本身的錨:
In [5]: from bs4 import BeautifulSoup
...: import requests
...: url = 'http://www.eia.gov/dnav/pet/pet_sum_sndw_dcus_nus_w.htm'
...: soup = BeautifulSoup(requests.get(url).content,"html.parser")
...: anchor_td = soup.find("a", href="./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_Y
...: IR_NUS_MBBLD&f=W").parent
...: data = [td.text for td in anchor_td.find_all_previous("td", limit=6)]
...:
In [6]: data
Out[6]: ['936', '934', '919', '957', '951', '928']
現在我們得到我們想要的。
最後,我們可以得到錨的祖父母即主TD然後使用兩個類的名字在我們選擇使用一個選擇:
href = "./hist/LeafHandler.ashx?n=PET&s=W_EPOOXE_YIR_NUS_MBBLD&f=W"
grandparent = soup.find("a", href=href).parent.parent
data = [td.text for td in grandparent.select("td.Current2,td.DataB")]
同樣的數據給出了我們有同樣的產出。
看起來像你可以得到.xls格式的頁面數據http://www.eia.gov/dnav/pet/xls/PET_SUM_SNDW_DCUS_NUS_W.xls可能更容易解析? – Mono
謝謝。我看到了,但只是好奇,如果有人在Python中嘗試過類似的東西。 – judabomber
你也可以[在Python中解析.xls](http:// stackoverflow。COM /問題/ 2942889 /讀取的解析,Excel的XLS-文件與 - 蟒蛇)。但是,如果你想與BeautifulSoup做..我假設你正試圖從HTML表中提取所有的數據?或者它只是與特定的href關聯的行? – Mono