2017-06-02 76 views
0

我想解析一個html,裏面有幾個li元素。這只是我用兩個div保存的示例html。我有將近7000個div來解析。並非所有的div都具有其中的所有li元素。例如, <li class="brewery_type">可能不適用於所有div。由於這個代碼將不能夠將所有的值填充到tibble中。在那種情況下,我仍然可以通過這個解析並用NA來替換該div中缺少的li元素。在R中使用rvest替換丟失的html_nodes

library(rvest) 
library(dplyr) 

html_file <- '<!DOCTYPE html> 
<html> 

<head> 
    <title>Page Title</title> 
</head> 

<body> 
    <div class="brewery" id="brewery"> 
     <ul class="vcard simple"> 
      <li class="name"> Bradley Farm/RB Brew, LLC</li> 
      <li class="address">317 Springtown Rd </li> 
      <li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li> 
      <li class="telephone">Phone: (845) 255-8769</li> 
      <li class="brewery_type">Type: Micro</li> 
      <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li> 
     </ul> 
     <ul class="vcard simple col2"></ul> 
    </div> 
    <div class="brewery"> 
     <ul class="vcard simple"> 
      <li class="name">(405) Brewing Co</li> 
      <li class="address">1716 Topeka St </li> 
      <li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li> 
      <li class="telephone">Phone: (405) 816-0490</li> 
      <li class="brewery_type">Type: Micro</li> 
      <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li> 
     </ul> 
     <ul class="vcard simple col2"></ul> 
    </div> 
</body>' 

page <- read_html(html_file) 

tibble(
    name = page %>% html_nodes(".vcard .name") %>% html_text(), 
    address = page %>% html_nodes(".vcard .address") %>% html_text(), 
    type = page %>% html_nodes(".vcard .brewery_type") %>% html_text() %>% stringr::str_replace_all("^Type: ", ""), 
    website = page %>% html_nodes(".vcard .url a") %>% html_attr("href") 
) 

回答

1

相反解析所有標籤在一次通過的,我解析出的div.brewery成元件/節點的列表,然後分別提取從每個啤酒廠所請求的信息。效率不高,但會跟蹤每位家長的相關信息。此模型假定每個父項只有一個子元素。因此,每div.brewery只有一個名稱,地址,網站

library(rvest) 

html_file <- '<!DOCTYPE html> 
<html> 
<head> 
<title>Page Title</title> 
</head> 

<body> 
<div class="brewery" id="brewery"> 
<ul class="vcard simple"> 
<li class="name"> Bradley Farm/RB Brew, LLC</li> 
<li class="address">317 Springtown Rd </li> 
<li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li> 
<li class="telephone">Phone: (845) 255-8769</li> 
<li class="brewery_type">Type: Micro</li> 
<li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li> 
</ul> 
<ul class="vcard simple col2"></ul> 
</div> 
<div class="brewery"> 
<ul class="vcard simple"> 
<li class="name">(405) Brewing Co</li> 
<li class="address">1716 Topeka St </li> 
<li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li> 
<li class="telephone">Phone: (405) 816-0490</li> 

<li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li> 
</ul> 
<ul class="vcard simple col2"></ul> 
</div> 
</body>' 

page <- read_html(html_file) 

breweries<-page %>% html_nodes("div.brewery") 

name<- breweries %>% html_node(".vcard .name") %>% html_text() 
address<- breweries %>% html_node(".vcard .address") %>% html_text() 
type<- breweries %>% html_node(".vcard .brewery_type") %>% html_text() 
type<-gsub("^Type: ", "", type) 
website<- breweries %>% html_node(".vcard .url a") %>% html_text() 

tibble(name, address, type, website)