刮網頁的問題

我正在使用R爲其數據抓取以下網頁：http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml。我感興趣的一個特定概念是開始時間天氣（位於頁面的一半處），但我一直無法抓取這些信息。刮網頁的問題

使用選擇的小工具，我編碼：

game <- read_html(x= "http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml") 

weather <- game %>% 
html_node(".section_wrapper+ .section_wrapper div:nth-child(5)") %>% 
html_text() 

weather 

[1] NA

如何修改我的代碼，以避免NA？這也發生在其他遊戲的頁面中。

我希望你能幫助我！我似乎無法找到正確的道路。

來源

2017-04-18 josehernandez

使用'基地:: readLines'？像'行< - readLines（「http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml」）; [（grepl（「開始時間天氣」，行））]' – chinsoon12

chinsoon12，我只是試過了，它的工作！非常感謝。 – josehernandez

您可以使用readLines解析爲數據如下之前，子集開始時間天氣線：

#http://www.baseball-reference.com/boxes/ARI/ARI201403220.shtml 
lines <- readLines("http://www.baseball-reference.com/boxes/BAL/BAL201403310.shtml") 

library(rvest) 
weather <- read_html(lines[which(grepl("Start Time Weather", lines))]) %>% 
    html_node("div") %>% 
    html_text() 
gsub("Start Time Weather: ", "", weather)

來源

2017-04-18 07:41:29 chinsoon12

謝謝chinsoon12！ – josehernandez

刮網頁的問題

回答

相關問題