使用R2HTML與rvest/XML2

我正在讀的新包裝XML2 this博客文章。此前，rvest過去依賴XML，並通過組合功能兩種封裝做了很多工作，我更容易（至少）：例如，我會用htmlParse從XML包時我無法讀取使用html HTML頁面（現在他們叫read_html）。使用R2HTML與rvest/XML2

查看this舉例說明，然後我可以在解析的頁面上使用rvest功能，如html_nodes,html_attr。現在，rvest取決於XML2，這是不可能的（至少在表面上）。

我只是想知道什麼是XML和XML2之間的基本區別。除了在前面提到的post中歸因於XML包的作者之外，包的作者沒有解釋XML和XML2之間的區別。

又如：

library(R2HTML) #save page as html and read later 
library(XML) 
k1<-htmlParse("https://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml") 
head(getHTMLLinks(k1),5) #This works 

[1] "//stackoverflow.com"   "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"   
[5] "http://meta.stackoverflow.com" 

# But, I want to save HTML file now in my working directory and work later 

HTML(k1,"k1") #Later I can work with this 
rm(k1) 
#read stored html file k1 
head(getHTMLLinks("k1"),5)#This works too 

[1] "//stackoverflow.com"   "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"   
[5] "http://meta.stackoverflow.com" 

#with read_html in rvest package, this is not possible (as I know) 
library(rvest) 
library(R2HTML) 
k2<-read_html("https://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml") 

#This works 
df1<-k2 %>% 
html_nodes("a")%>% 
html_attr("href") 

head(df1,5) 
[1] "//stackoverflow.com"   "http://chat.stackoverflow.com" "http://blog.stackoverflow.com" "//stackoverflow.com"   
[5] "http://meta.stackoverflow.com" 

# But, I want to save HTML file now in my working directory and work later 
HTML(k2,"k2") #Later I can work with this 
rm(k2,df1) 
#Now extract webpages by reading back k2 html file 
#This doesn't work 
k2<-read_html("k2") 

df1<-k2 %>% 
html_nodes("a")%>% 
html_attr("href") 

df1 
character(0)

更新：

#I have following versions of packages loaded: 
lapply(c("rvest","R2HTML","XML2","XML"),packageVersion) 
[[1]] 
[1] ‘0.2.0.9000’ 

[[2]] 
[1] ‘2.3.1’ 

[[3]] 
[1] ‘0.1.1’ 

[[4]] 
[1] ‘3.98.1.2’

我使用Windows 8，R 3.2.1和RStudio 0.99.441。

來源

2015-06-22 user227710

這對我工作的罰款。我剛安裝了'rvest'的最新開發版。也許你應該更新你的（'R2HTML_2.3.1'，'rvest_0.2.0.9000'，'xml2_0.1.1'） – MrFlick

@MrFlick：我使用的是與你的版本相同的版本。它會運行，但正如你在文章中看到的那樣，它將'character（0）'作爲輸出。 – user227710

我無法複製。我假設你從github回購安裝了rvest。在開發過程中，版本號似乎沒有改變。我仍然建議你嘗試從repo重新安裝以獲取最新版本。此外，也許發佈你在什麼操作系統，什麼R版本（基本上你的'sessionInfo（）'，如果這是一個邊緣情況）。 – MrFlick

的R2HTML包似乎只是capture.out XML對象，然後寫道，回磁盤。這似乎不是將HTML/XML數據保存回磁盤的可靠方法。兩者可能不同的原因是XML數據打印出的數據與xml2數據不同。你可以定義一個函數調用as.character()而不是依靠capture.output

HTML.xml_document<-function(x, ...) HTML(as.character(x),...)

或者你很可能與R2HTML完全跳過和xml2數據直接與write_xml寫出來的。

，也許是最好的辦法是先下載該文件，然後導入。

download.file("http://stackoverflow.com/questions/30897852/html-in-rvest-verses-htmlparse-in-xml", "local.html") 
k2 <- read_html("local.html")

來源

2015-06-22 17:57:19 MrFlick

謝謝。 'download.file'是不錯的選擇（從來沒有這個想法）。 – user227710

使用R2HTML與rvest/XML2

回答

相關問題