2015-10-18 90 views
0

我正在從多個網站Link here中取出數據並嘗試將它們全部合併到一個數據框中。該網站有一個循環模式,所以我嘗試這樣得到的鏈接在一個地方,然後遍歷一個for循環: 這裏的代碼我工作的塊:追加/合併到循環中的R中的向量

ingredientsList = c() 
links<-paste0("http://www.bbc.co.uk/food/ingredients/by/letter/",letters) 
#prints out: 
#http://www.bbc.co.uk/food/ingredients/by/letter/a 
#http://www.bbc.co.uk/food/ingredients/by/letter/b 
#http://www.bbc.co.uk/food/ingredients/by/letter/c and so-on till z 
for(i in 1:26){ 
    session<-html_session(links[i]) 
    ingredients<-session %>% html_nodes("ol:nth-child(4) a") %>% html_text() 
    ingredientsList<-c(ingredientsList,ingredients) 
} 

結果是ingredientList即應理想地包含從'A'到'Z' 所有成分的列表我想學習R和相當新的刮,我真的很感激一些指導。謝謝。

+0

代碼塊運行正常,但沒有以適當的格式顯示輸出,我想知道我必須使用什麼樣的格式,以及上述方法是否是優化的。 –

回答

1

您將使用list代替vector的更好,你可以使用lapply直接創建,就像這樣:

library(rvest) 
library(stringr) 

url <- "http://www.bbc.co.uk/food/ingredients/by/letter/" 
urls <- paste0(url, letters) 

ingredientsList <- lapply(urls, function(u) { 
    u %>% 
    html_session() %>% 
    html_nodes("ol:nth-child(4) a") %>% 
    html_text() %>% 
    str_replace_all(pattern = "\n|Related|\\(\\d\\)|\\s{2,}", replacement = "") %>% ## clean results (remove space, etc) 
    subset(!str_detect(., "^\\s{1}")) 
}) 

names(ingredientsList) <- LETTERS 
str(ingredientsList) 
## List of 26 
## $ A: chr [1:33] "Acidulated water" "Ackee" "Acorn squash" "Aduki beans" ... 
## $ B: chr [1:101] "Bacon" "Bagel" "Baguette" "Baked beans" ... 
## $ C: chr [1:174] "Cabbage" "Caerphilly" "Cake" "Calasparra rice" ... 
## $ D: chr [1:31] "Dab" "Daikon" "Damsons" "Dandelion" ... 
## $ E: chr [1:15] "Edam" "Eel" "Egg" "Egg liqueur" ... 
## $ F: chr [1:50] "Farfalle" "Fat" "Fennel" "Fennel seeds" ... 
## $ G: chr [1:53] "Galangal" "Game" "Gammon" "Garam masala" ... 
## $ H: chr [1:30] "Habañero chillies" "Haddock" "Haggis" "Hake" ... 
## $ I: chr [1:5] "Ice cream" "Iceberg lettuce" "Icing" "Icing sugar" ... 
## $ J: chr [1:12] "Jaggery" "Jam" "January King cabbage" "Japanese pumpkin" ... 
## $ K: chr [1:12] "Kabana" "Kale" "Ketchup" "Ketjap manis" ... 
## $ L: chr [1:49] "Lager" "Lamb" "Lamb breast" "Lamb chop" ... 
## $ M: chr [1:76] "Macadamia" "Macaroni" "Macaroon" "Mace" ... 
## $ N: chr [1:14] "Naan bread" "Nachos" "Nashi" "Nasturtium" ... 
## $ O: chr [1:20] "Oatcakes" "Oatmeal" "Oats" "Octopus" ... 
## $ P: chr [1:109] "Paella" "Pak choi" "Palm sugar" "Pancakes" ... 
## $ Q: chr [1:6] "Quail" "Quail's egg" "Quark" "Quatre-épices" ... 
## $ R: chr [1:62] "Rabbit" "Rack of lamb" "Radicchio" "Radish" ... 
## $ S: chr [1:125] "Safflower oil" "Saffron" "Sage" "Salad" ... 
## $ T: chr [1:47] "T-bone steak" "Tabasco" "Taco" "Tagliatelle" ... 
## $ U: chr "Unleavened bread" 
## $ V: chr [1:18] "Vacherin" "Vanilla essence" "Vanilla extract" "Vanilla pod" ... 
## $ W: chr [1:38] "Waffles" "Walnut" "Walnut oil" "Wasabi" ... 
## $ X: chr(0) 
## $ Y: chr [1:4] "Yam" "Yeast" "Yellow lentil" "Yoghurt" 
## $ Z: chr [1:2] "Zander" "Zest" 

或者,我們可以使用類似的一個方法你有for循環

n <- length(letters) 
ingredientsList <- vector(mode = "list", length = n) 
names(ingredientsList) <- LETTERS 

for(i in 1:n) { 
    session<-html_session(urls[i]) 
    ingredientsList[[i]] <-session %>% 
          html_nodes("ol:nth-child(4) a") %>% 
          html_text() 
} 

但訣竅是堅持list保持您的結果。

+0

它工作得更好@dickoa我也遇到過這個詞hierarchal scraping其中我們可以從鏈接中的子鏈接提取信息任何資源,你可以建議我可以查找? –