使用rvest訪問html表格

所以我想抓取一些NBA數據。以下是我迄今爲止，它是完美的功能：使用rvest訪問html表格

install.packages('rvest') 
library(rvest) 

url = "https://www.basketball-reference.com/boxscores/201710180BOS.html" 
webpage = read_html(url) 
table = html_nodes(webpage, 'table') 
data = html_table(table) 

away = data[[1]] 
home = data[[3]] 

colnames(away) = away[1,] #set appropriate column names 
colnames(home) = home[1,] 

away = away[away$MP != "MP",] #remove rows that are just column names 
home = home[home$MP != "MP",]

的問題是，這些表不包括球隊的名字，這是很重要的。爲了獲得這些信息，我想我會在網頁上刮掉四個因素表，但是，rvest似乎並不認爲這是一張表。包含四個因素表DIV的是：

<div class="overthrow table_container" id="div_four_factors">

並且表：

<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">

這讓我覺得，我可以沿着

table = html_nodes(webpage,'#div_four_factors')

行通過一些訪問表

但這似乎不工作，因爲我只是得到一個空的列表。我怎樣才能訪問四個因素表？

來源

2017-10-20 Daniel

我絕不是一個HTML專家，但似乎你感興趣的表已經在源代碼中註釋掉了，然後在渲染之前的某個時刻評論被覆蓋。

如果我們假設主隊始終排名第二，我們可以只使用位置參數和刮頁面上的其他表：

table = html_nodes(webpage,'#bottom_nav_container') 
teams <- html_text(table[1]) %>% 
    stringr::str_split("Schedule\n") 

away$team <- trimws(teams[[1]][1]) 
home$team <- trimws(teams[[1]][2])

顯然不是最乾淨的解決方案，但生活就是這樣，在世界網頁抓取

來源

2018-02-21 01:01:05 Stedy

使用rvest訪問html表格

回答

相關問題