刮一個循環，並避免404錯誤

我想刮我的項目的某些天文學相關定義的維基。代碼工作得很好，但我無法避免404s。我試過tryCatch。我想我在這裏錯過了一些東西。刮一個循環，並避免404錯誤

我正在尋找一種在運行循環時克服404s的方法。這裏是我的代碼：

library(rvest) 
library(httr) 
library(XML) 
library(tm) 


topic<-c("Neutron star", "Black hole", "sagittarius A") 

for(i in topic){ 

    site<- paste("https://en.wikipedia.org/wiki/", i) 
    site <- read_html(site) 

    stats<- xmlValue(getNodeSet(htmlParse(site),"//p")[[1]]) #only the first paragraph 
    #error = function(e){NA} 

    stats[["topic"]] <- i 

    stats<- gsub('\\[.*?\\]', '', stats) 
    #stats<-stats[!duplicated(stats),] 
    #out.file <- data.frame(rbind(stats,F[i])) 

    output<-rbind(stats,i) 

}

來源

2016-09-22 Sree Krishna

我認爲你的意思是記下錯誤，然後跳到循環的下一個迭代？ –

相關/也許重複後http://stackoverflow.com/questions/8093914 – zx8754

作爲一個側面說明，看看http://stackoverflow.com/questions/14693956/how-can-i-prevent-rbind-from -geting-really-slow-as-dataframe-grow-greater – konvas

構建在使用sprintf循環變量的URL。
從段落節點中提取所有正文文本。
刪除返回長（0）
我加了一個步驟中，任何載體包括所有由前綴[paragraph - n]用於reference..because嗯......朋友們不要讓朋友浪費數據或進行註釋的正文文本多個http請求。
建立在您的主題列表中的每個迭代一個數據幀中的以下表格：
綁定列表中的所有成一體data.frames的...
wiki_url：應該是顯而易見的
主題：從主題列表
info_summary：第一段（你在帖子中提到的）
all_info：如果你需要more..ya知道。
請注意，我用的rvest

舊的，源版本爲便於理解，我只是在分配名稱的HTML，你們會read_html。

library(rvest) 
    library(jsonlite) 

    html <- rvest::read_html 

    wiki_base <- "https://en.wikipedia.org/wiki/%s" 

    my_table <- lapply(sprintf(wiki_base, topic), function(i){ 

     raw_1 <- html_text(html_nodes(html(i),"p")) 

     raw_valid <- raw_1[nchar(raw_1)>0] 

     all_info <- lapply(1:length(raw_valid), function(i){ 
      sprintf(' [paragraph - %d] %s ', i, raw_valid[[i]]) 
     }) %>% paste0(collapse = "") 

     data.frame(wiki_url = i, 
        topic = basename(i), 
        info_summary = raw_valid[[1]], 
        trimws(all_info), 
        stringsAsFactors = FALSE) 

    }) %>% rbind.pages 

    > str(my_table) 
    'data.frame': 3 obs. of 4 variables: 
    $ wiki_url : chr "https://en.wikipedia.org/wiki/Neutron star"  "https://en.wikipedia.org/wiki/Black hole" "https://en.wikipedia.org/wiki/sagittarius A" 
    $ topic  : chr "Neutron star" "Black hole" "sagittarius A" 
    $ info_summary: chr "A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and densest stars kno"| __truncated__ "A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even particles and electrom"| __truncated__ "Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constellation Sagittarius"| __truncated__ 
    $ all_info : chr " [paragraph - 1] A neutron star is the collapsed core of a large star (10–29 solar masses). Neutron stars are the smallest and "| __truncated__ " [paragraph - 1] A black hole is a region of spacetime exhibiting such strong gravitational effects that nothing—not even parti"| __truncated__ " [paragraph - 1] Sagittarius A or Sgr A is a complex radio source at the center of the Milky Way. It is located in the constell"| __truncated__

EDIT

一種用於錯誤處理功能....返回一個邏輯。所以這成爲我們的第一步。

url_works <- function(url){ 
tryCatch(
    identical(status_code(HEAD(url)),200L), 
    error = function(e){ 
     FALSE 
    }) 
}

根據您所使用的「系外行星」這裏是所有的維基頁面適用的數據：

exo_data <- (html_nodes(html('https://en.wikipedia.org/wiki/List_of_exoplanets'),'.wikitable')%>%html_table)[[2]]

str(exo_data)

'data.frame': 2048 obs. of 16 variables: 
$ Name       : chr "Proxima Centauri b" "KOI-1843.03" "KOI-1843.01" "KOI-1843.02" ... 
$ bf       : int 0 0 0 0 0 0 0 0 0 0 ... 
$ Mass (Jupiter mass)   : num 0.004 0.0014 NA NA 0.1419 ... 
$ Radius (Jupiter radii)  : num NA 0.054 0.114 0.071 1.012 ... 
$ Period (days)     : num 11.186 0.177 4.195 6.356 19.224 ... 
$ Semi-major axis (AU)   : num 0.05 0.0048 0.039 0.052 0.143 0.229 0.0271 0.053 1.33 2.1 ... 
$ Ecc.       : num 0.35 1.012 NA NA 0.0626 ... 
$ Inc. (deg)     : num NA 72 89.4 88.2 87.1 ... 
$ Temp. (K)      : num 234 NA NA NA 707 ... 
$ Discovery method    : chr "radial vel." "transit" "transit" "transit" ... 
$ Disc. Year     : int 2016 2012 2012 2012 2010 2010 2010 2014 2009 2005 ... 
$ Distance (pc)     : num 1.29 NA NA NA 650 ... 
$ Host star mass (solar masses) : num 0.123 0.46 0.46 0.46 1.05 1.05 1.05 0.69 1.25 0.22 ... 
$ Host star radius (solar radii): num 0.141 0.45 0.45 0.45 1.23 1.23 1.23 NA NA NA ... 
$ Host star temp. (K)   : num 3024 3584 3584 3584 5722 ... 
$ Remarks      : chr "Closest exoplanet to our Solar System. Within host star’s habitable zone; possibl 
y Earth-like." "controversial" "controversial" "controversial" ...

測試我們url_works功能上的隨機樣本表

tests <- dplyr::sample_frac(exo_data, 0.02) %>% .$Name

現在讓我們建立一個名爲，檢查url的邏輯表，以及一個邏輯，如果url是有效的，一步創建兩個數據框的列表，其中一個包含不存在的url。和另一個做。那些檢查我們可以通過上述功能運行沒有問題。這樣錯誤處理在我們實際開始嘗試在循環中解析之前完成。避免頭痛，並給出需要進一步研究的項目參考。

3210

顯然，列表中的第二項包含帶有有效URL的數據框，因此將之前的函數應用於該列中的url列。請注意，爲了解釋的目的，我抽取了所有行星的表格......有2400多個奇怪的名字，因此檢查將花費一兩分鐘的時間在您的案例中運行。希望能爲你包裝起來。

來源

2016-09-22 12:55:54

我試圖用正則表達式來清理文本，所以我嘗試了tm包。我只需要第一段來測試我的代碼。我知道這不會花費很多時間，但我有一長串話題。你給的工作很好，但我沒有看到錯誤處理的一個步驟。 –

塞巴斯蒂安，你是對的！我正在尋找一種方法來跳過錯誤URL或記下變量中的錯誤，並轉到下一個項目。 –

錯誤的根源在哪裏......這是不清楚的。是列表中的項目可能沒有頁面的錯誤？你的榜樣沒有拋出錯誤...所以我認爲這是基礎。你需要弄清楚錯誤會從哪裏來......並且就正則表達式而言......這很模糊，但不管。 –

刮一個循環，並避免404錯誤

回答

相關問題