使用熊貓獲取多個表從網頁

我用熊貓來解析從以下頁面的數據：http://kenpom.com/index.php?y=2014 使用熊貓獲取多個表從網頁

來獲取數據，我寫：

dfs = pd.read_html(url)

的數據看起來不錯，被完全解析，除了它僅從第40行開始獲取數據。這似乎是分離表的問題，這使得熊貓不能獲得所有的信息。

如何讓大熊貓獲得該網頁上所有表格的所有數據？

來源

2017-02-14 user7012893

您發佈的網頁的HTML有多個<thead>和<tbody>標籤極其混淆pandas.read_html。

在此之後SO thread可以手動unwrap那些標籤：

import urllib 
from bs4 import BeautifulSoup 

html_table = urllib.request.urlopen(url).read() 

# fix HTML 
soup = BeautifulSoup(html_table, "html.parser") 
# warn! id ratings-table is your page specific 
for table in soup.findChildren(attrs={'id': 'ratings-table'}): 
    for c in table.children: 
     if c.name in ['tbody', 'thead']: 
      c.unwrap() 

df = pd.read_html(str(soup), flavor="bs4") 
len(df[0])

返回369。

來源

2017-02-14 13:04:46 tworec

使用熊貓獲取多個表從網頁

回答

相關問題