R中的readHTMLTable - 跳過NULL值

我正在嘗試使用R函數readHTMLTable從www.racingpost.com上的聯機數據庫中收集數據。我有一個包含30,000個獨特ID的CSV文件，可用於識別單個馬匹。不幸的是，這些ID的少數領導readHTMLTable返回錯誤：R中的readHTMLTable - 跳過NULL值

錯誤(function (classes, fdef, mtable)：無法找到繼承的方法函數「readHTMLTable」簽字「‘NULL’」

我的問題是 - 是否可以設置一個包裝函數，它將跳過返回NULL值的ID，但繼續讀取剩餘的HTML表？讀數停在每個NULL值。

我至今嘗試過是這樣的：

ids = c(896119, 766254, 790946, 556341, 62736, 660506, 486791, 580134, 0011, 580134)

這些都是有效的馬欄IDS 0011將返回NULL值。然後：

scrapescrape <- function(x) {  
    link <- paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=",x)  
    if (!is.null(readHTMLTable(link, which=2))) { 
    Frame1 <- readHTMLTable(link, which=2) 
    } 
} 

total_data = c(0) 
for (id in ids) { 
    total_data = rbind(total_data, scrapescrape(id)) 
}

但是，我認爲錯誤返回在if語句，這意味着函數停止時，它達到第一個NULL值。任何幫助將不勝感激 - 非常感謝。

來源

2017-02-16 Robertlemoko

在閱讀HTML表格之前，您可以先分析HTML（檢查您獲得的頁面，並找到識別錯誤結果的方法）。

但你也可以確保該函數返回什麼（NA）時拋出一個錯誤，像這樣：

library(XML) 

scrapescrape <- function(x) { 

    link <- paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=",x) 

    tryCatch(readHTMLTable(link, which=2), error=function(e){NA}) 

    } 
} 

ids <- c(896119, 766254, 790946, 556341, 62736, 660506, 486791, 580134, 0011, 580134) 

lst <- lapply(ids, scrapescrape) 

str(lst)

來源

2017-02-16 09:46:08 Wietze314

使用rvest你可以這樣做：

require(rvest) 
require(purrr) 
paste0("http://www.racingpost.com/horses/horse_home.sd?horse_id=", ids) %>% 
    map(possibly(~html_session(.) %>% 
       read_html %>% 
       html_table(fill = TRUE) %>% 
       .[[2]], 
       NULL)) %>% 
    discard(is.null)

最後一行丟棄所有「失敗」嘗試。如果你想讓他們放棄最後一行

來源

2017-02-16 10:03:35 Rentrop

R中的readHTMLTable - 跳過NULL值

回答

相關問題