網址從一個.csv文件中列出的5K網站的網頁抓取表，所有在R

因此，我正在努力從以下網站提取數據：http://livingwage.mit.edu ...在縣一級，並嘗試了很多不同的迭代使用rvest包來提取數據。不幸的是，大約有5K個縣。網址從一個.csv文件中列出的5K網站的網頁抓取表，所有在R

我已經提取所有的網址到單個.csv文件中。這些網址的格式爲「http://livingwage.mit.edu/counties/ ...」，其中「...」是縣代碼後的州代碼。

我想的數據具有的CSS標識符（從SelectorGadget）

css = '.wages_table .even .col-NaN , .wages_table .results .col-NaN'

或

xpath = //*[contains(concat(" ", @class, " "), concat(" ", "wages_table", " "))]//*[contains(concat(" ", @class, " "), concat(" ", "even", " "))]//*[contains(concat(" ", @class, " "), concat(" ", "col-NaN", " "))] | //*[contains(concat(" ", @class, " "), concat(" ", "wages_table", " "))]//*[contains(concat(" ", @class, " "), concat(" ", "results", " "))]//*[contains(concat(" ", @class, " "), concat(" ", "col-NaN", " "))]

中的XPath這是我開始：

library(rvest) 
url <- read_html("http://livingwage.mit.edu/counties/01001") 
url %>% 
html_nodes("table") %>% 
    .[[1]] %>% 
    html_table()

..但一次只能提取一張表，並得到了我不想要的標題和最後一行。

所以，我想是這樣的：

counties <- 01001:54500 
urls <- paste0("http://livingwage.mit.edu/counties/", counties) 
get_table <- function(url) { 
    url %>% 
    read_html() %>% 
    html_nodes("table") %>% 
    .[[1]] %>% 
    html_table() 
} 
results <- sapply(urls, get_table)

...但很快就意識到，並非所有的號碼是連續的（他們大多是單數），但不連續要麼，即一個國家可以只有4個縣，只有例如達到〜/ 10009的網址。

URL <- read.csv("~/Desktop/LW_url.csv", header=T) 
URL %>% 
html_nodes("table", ".wages_table .even .col-NaN , .wages_table .results .col-NaN") %>% 
    .[[1]] %>% 
    html_table()

...，知道的CSS和讀取所有不喜歡說話：

最後，我想，當訪問我的桌面上的網址列表的.csv得到儘可能這彼此很好。

任何幫助，使這種情況將得到徹底讚賞。

來源

2016-11-28 J-Dizzle

他們顯然有一個「真實」的格式的數據（這是他們如何建立網站）。您是否嘗試過使用電子郵件發送郵件（電子郵件鏈接在Glasmeier博士的頁面上）？刮刮會容易出錯。獲取實際數據文件可能需要不到一天的時間。 – hrbrmstr

如果@hrbrmstr方法不起作用，您可以嘗試使用'curl'（> 2.0）的異步功能。以下是一個示例https://github.com/jeroenooms/curl/blob/master/examples/crawler.R和https://cran.r-project.org/web/packages/curl/vignettes/intro.html# async_requests – Rentrop

@hrbrmstr和Floo0，謝謝你的回覆。我首先聯繫了Glasmeier博士，她的回答是，「此時我不分發數據，但我正在確定是否要這樣做。」否則我無法說服她。 –

我想這就是你要找的。

install.packages("pbapply") # has a nice addition to lapply, estimates run time 
library(rvest) 
library(dplyr) 
library(magrittr) 
library(pbapply) 

## Get State urls 

lwc.url <- "http://livingwage.mit.edu" 

state.urls <- read_html(lwc.url) 
state.urls %<>% html_nodes(".col-md-6 a") %>% xml_attr("href") %>% 
    paste0(lwc.url, .) 



## get county urls and county names 
    county.urls <- lapply(state.urls, function(x) read_html(x) %>% 
     html_nodes(".col-md-3 a") %>% xml_attr("href") %>% 
     paste0(lwc.url, .)) %>% unlist 


## Get the tables Hourly wage & typical Expenses 

dfs <- pblapply(county.urls, function(x){ 

    LWC <- read_html(x) 

    df <- rbind(
    LWC %>% html_nodes("table") %>% .[[1]] %>% 
     html_table() %>% setNames(c("Info", names(.)[-1])), 

    LWC %>% html_nodes("table") %>% .[[2]] %>% 
     html_table() %>% setNames(c("Info", names(.)[-1]))) 

    title <- LWC %>% html_nodes("h1") %>% html_text 

    df$State <- trimws(gsub(".*,", "", title)) 
    df$County <- trimws(gsub(".*for (.*) County.*", "\\1", title)) 
    df$url <- x 

    df 

}) 

df <- data.table::rbindlist(dfs) 
View(df)

來源

2016-11-29 18:06:23

這是一個美麗的解決方案，收集了比我所需要的更多，但我絕對可以從這裏獲得！我非常感謝你的意願和能力。 –

網址從一個.csv文件中列出的5K網站的網頁抓取表，所有在R

回答

相關問題