2016-03-28 70 views
0

我有一個包含多個url的chr列表。我想從每個這些網址下載內容'。R:在Rvest中使用pipechain命令刮掉多個網址

爲了避免寫出數以百計的命令,我希望通過使用lapply的循環自動執行該過程。

但是,我的命令返回一個錯誤。是否有可能從多個網址中刪除?

電流接近

長法:工作,但我希望它自動化

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England") 

library(rvest) 
library(httr) # required for user_agent command 

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring)) 
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus") 
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia") 
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt") 

自動/循環:不工作。

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England") 

library(rvest) 
library(httr) # required for user_agent command 

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring)) 
lapply(urls, .%>% jump_to(session)) 
Error: is.session(x) is not TRUE 

摘要

我想下面的兩個過程,jump_to()writeBin()自動化,如下面的代碼中

session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus") 
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia") 
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt") 

回答

0

你可以做這樣的事情:

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England") 
require(httr) 
require(rvest) 
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" 
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring)) 

outfile <- sprintf("%s.html", sub(".*/", "", urls)) 

jump_and_write <- function(x, url, out_file){ 
    tmp = jump_to(x, url) 
    writeBin(tmp$response$content, out_file) 
} 

for(i in seq_along(urls)){ 
    jump_and_write(session, urls[i], outfile[i]) 
} 
+0

你能解釋爲什麼使用'lapply()'的原始方法不起作用嗎?我的理解是,它在一個列表上循環一個函數,這與'for()'循環中的很多相同。 –

+1

您使用的參數傳遞順序錯誤:'lapply(urls,。%>%jump_to(session))'使用'jump_to(url,session)',但'jump_to'需要'jump_to(session,url)'。你可以通過使用'lapply(url,。%>%jump_to(session,。))'來解決這個問題。看看嗎?magrittr ::'%>%'(在%>%附近)' – Rentrop

+0

謝謝。是否有可能使用最後的'writeBin()'命令? –