R閱讀並解析HTML到列表

我一直在嘗試閱讀&解析一些HTML以獲取動物收容所的動物條件列表。我相信我對HTML解析的經驗不足沒有幫助，但我似乎沒有得到快速的地方。R閱讀並解析HTML到列表

這裏是HTML的一個片段：

<select multiple="true" name="asilomarCondition" id="asilomarCondition"> 

    <option value="101"> 
     Behavior- Aggression, Confrontational-Toward People (mild) 
     - 
     TM</option> 
.... 
</select>

這裏只有一個標籤與<select...>，其餘都是<option value=x>。

我一直在使用XML庫。我可以刪除換行符和標籤，但沒有成功移除標籤：

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n") 
conditions.text <- gsub('[\t\n]',"",conditions.html)

作爲最後的結果，我想所有的條件清單，我可以進一步處理以供日後使用作爲因子名稱：

Behavior- Aggression, Confrontational-Toward People (mild)-TM 
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU 
...

我不知道我是否需要使用XML庫（或另一個庫），或者如果gsub模式就足夠了（無論哪種方式，我需要找出如何使用它）。

來源

2016-08-11 kimbekaw

你可以指向帶有該選擇框的完整URL或擴展該片段嗎？ – hrbrmstr

我發現rvest軟件包更易於使用。如果你可以提供一個鏈接到網站，有人可以編寫你的解決方案。 – Dave2e

它是HTML。這是一個表單中的選擇列表@alistaire – hrbrmstr

下面是一個使用rvest包開始：

library(rvest) 
#read the html page 
page<-read_html("test.html") 
#get the text from the "option" nodes and then trim the whitespace 
nodes<-trimws(html_text(html_nodes(page, "option"))) 

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters 
nodes<-gsub("\n", "", nodes) 
nodes<-gsub(" ", "", nodes)

矢量節點應該是你所要求的結果。此示例基於上面提供的有限示例，這個實際頁面可能會有意想不到的結果。

來源

2016-08-12 23:02:04 Dave2e

謝謝，@ Dave2e！這工作完美！我還有一些額外的角色需要清理，但這很容易處理你的例子。開始數據清理的其餘部分！：○ – kimbekaw

R閱讀並解析HTML到列表

回答

相關問題