2017-06-05 74 views
0

我很喜歡R的圖書館刮刮網站,但我正在努力尋找新的東西。從這個網頁 - http://www.naia.org/ViewArticle.dbml?ATCLID=205323044 - 我試圖刮高校主表。使用R rvest圖書館在iframe中刮臉

這裏是我的代碼看起來像現在:

NAIA_url = "http://www.naia.org/ViewArticle.dbml?ATCLID=205323044" 
NAIA_page = read_html(NAIA_url) 

tables = html_table(html_nodes(NAIA_page, 'table')) 
# tables returns a length-2 list, however neither of these tables are the table I desire. 

# grab the correct iframe node 
iframe = html_nodes(NAIA_page, "iframe")[3] 

但是我掙扎過去這一點。 (1)由於某種原因,調用html_nodes不是抓住我想要的表。 (2),我不確定是否應該抓取iframe,然後嘗試從中抓取表格。

任何幫助表示讚賞!

+1

你應該得到的'iframe'的源,並從那裏搶表 – yeedle

回答

1

如果嵌入式iframe是html,則可以抓取iframe源代碼並從那裏獲取所需的表格。


library(rvest) 
#> Loading required package: xml2 
library(magrittr) 
"http://www.naia.org/ViewArticle.dbml?ATCLID=205323044" %>% 
    read_html() %>% 
    html_nodes("iframe") %>% 
    extract(3) %>% 
    html_attr("src") %>% 
    read_html() %>% 
    html_node("#searchResultsTable") %>% 
    html_table() %>% 
    head() 
#>         College or University  City, State 
#> 1     Central Christian College ATHLETICS  McPherson, KS 
#> 2 +     Crowley's Ridge College ATHLETICS  Paragould, AR 
#> 3      Edward Waters College ATHLETICS Jacksonville, Fl 
#> 4     Fisher College ADMISSIONS | ATHLETICS  Boston, MA 
#> 5  Georgia Gwinnett College ADMISSIONS | ATHLETICS Lawrenceville, GA 
#> 6 Lincoln Christian University ADMISSIONS | ATHLETICS  Lincoln, IL 
#> Conference Enrollment 
#> 1  A.I.I.  259 
#> 2  A.I.I.   0 
#> 3  A.I.I.  805 
#> 4  A.I.I.  600 
#> 5  A.I.I.  9,720 
#> 6  A.I.I.  1,060 
+0

精彩,感謝一噸 – Canovice