大數據（〜90k）XPath刮

我正在尋找一些有效的解決方案，以從the Vermont Secretaty of State中刮取清理的xpath數千次迭代。這是冠軍，我試圖刮中的XPath：大數據（〜90k）XPath刮

'//*[@id="content_wrapper"]/div[2]/div/h1'

我掙扎在尋找清潔高效的方式來運行一個循環，經過約90000頁環，抓住標題，並將其存儲在向量中。最終目標是導出包含頁面值和標題xpath的小數據框。我將使用這個數據框來爲數據庫中的未來搜索建立索引。

這是我到目前爲止有：

library(XML) 
library(rvest) 

election_value <- 1:90000 
title <- NA 

for (i in 1:90000) { 
    url <- sprintf("http://vtelectionarchive.sec.state.vt.us/elections/view/%s", election_value[i]) 
    if (is.null(tryCatch({read_html(url) %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1') %>% html_text()}, error=function(e){}))) { 
    title[i] <- NA } else { 
     title[i] <- read_html(url) %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1')} 
} 
vermont_titles <- data.frame(election_value, title) 
write.csv(vermont_titles, 'vermont_titles.csv')

不幸的是，該腳本無法正常工作，因爲html_nodes（）函數返回括號中的字符串，而不僅僅是文字。任何解決方案，將不勝感激，因爲這個腳本一直困擾我一個星期左右。

來源

2017-05-08 Hanna

請檢查您發佈的網址，「http://vtelectionarchive.sec.state.vt.us/elections/查看/％s「，產生一個'400錯誤的請求'。我認爲，正確的網址是http://vtelectionarchive.sec.state.vt.us/elections/search/year_from:1789/year_to:2016 – Ashish

'％s'正在替代它所在的數字b/c一個'sprintf（）'調用。目前還不清楚OP在嘗試做什麼。 – hrbrmstr

這裏是一個工作解決方案。查看評論瞭解更多詳細信息：

library(rvest) 

#url<-"http://vtelectionarchive.sec.state.vt.us/elections/view/68156" 
election_value <- 68150:68199 

#predefine title vector 
title <- vector("character", length=length(election_value)) 

for (i in 1:50) { 
    url <- sprintf("http://vtelectionarchive.sec.state.vt.us/elections/view/%s", election_value[i]) 
    #read page and test if null 
    page<-tryCatch({read_html(url)}, error=function(e){}) 
    if (is.null(page)) 
    { 
     title[i] <- NA } 
    else { 
    #parse the page and extract the title as text 
    node<-page %>% html_nodes(xpath='//*[@id="content_wrapper"]/div[2]/div/h1') 
    title[i] <- node %>% html_text() 
    } 
} 
vermont_titles <- data.frame(election_value, title) 
write.csv(vermont_titles, 'vermont_titles.csv')

有兩點要注意：閱讀的頁面中，而不是一次最多兩次，解析頁面只有1次將提高性能。另外，預先定義標題作爲矢量是另一個性能提升。

來源

2017-05-09 02:21:32 Dave2e

另一種解決方案可能是：

require(tidyverse) 
require(rvest) 
election_value <- c(3,68150:68153) 
base_url <- "http://vtelectionarchive.sec.state.vt.us/elections/view/" 
urls <- paste0(base_url, election_value) 

map(urls, possibly(read_html, NA_character_)) %>% 
    map_if(negate(is.na), html_nodes, xpath = '//*[@id="content_wrapper"]/div[2]/div/h1') %>% 
    map_if(negate(is.na), html_text) %>% 
    as.character %>% 
    tibble(election_value, title = .)

# A tibble: 5 × 2 
    election_value             title 
      <dbl>             <chr> 
1    3             NA 
2   68150 2014 Probate Judge General Election Rutland County 
3   68151 2014 Probate Judge General Election Orleans County 
4   68152 2014 Probate Judge General Election Grand Isle County 
5   68153 2014 Probate Judge General Election Lamoille County

來源

2017-05-10 14:01:22 Rentrop

大數據（〜90k）XPath刮

回答

相關問題