從R中XML2單獨的XML節點集隔離數據

我想反覆分離並從XML文檔操作節點集，但我在R.得到一個奇怪的行爲在xml_find_all（）函數在XML2包是否有人可以幫助我瞭解應用於nodeset的函數的範圍？從R中XML2單獨的XML節點集隔離數據

下面是一個例子：

library(xml2) 
library(dplyr) 

doc <- read_xml("<MEMBERS> 
        <CUSTOMER> 
         <ID>178</ID> 
         <FIRST.NAME>Alvaro</FIRST.NAME> 
         <LAST.NAME>Juarez</LAST.NAME> 
         <ADDRESS>123 Park Ave</ADDRESS> 
         <ZIP>57701</ZIP> 
        </CUSTOMER> 
        <CUSTOMER> 
         <ID>934</ID> 
         <FIRST.NAME>Janette</FIRST.NAME> 
         <LAST.NAME>Johnson</LAST.NAME> 
         <ADDRESS>456 Candy Ln</ADDRESS> 
         <ZIP>57701</ZIP> 
        </CUSTOMER> 
        </MEMBERS>" ) 

doc %>% xml_find_all('//*') %>% xml_path() 
# [1] "/MEMBERS"      "/MEMBERS/CUSTOMER[1]"   
# [3] "/MEMBERS/CUSTOMER[1]/ID"   "/MEMBERS/CUSTOMER[1]/FIRST.NAME" 
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS" 
# [7] "/MEMBERS/CUSTOMER[1]/ZIP"  "/MEMBERS/CUSTOMER[2]"   
# [9] "/MEMBERS/CUSTOMER[2]/ID"   "/MEMBERS/CUSTOMER[2]/FIRST.NAME" 
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS" 
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"

目的customer.01是包含僅來自該客戶的數據的節點集。

kids <- xml_children(doc) 

customer.01 <- kids[[1]] 

customer.01 
# {xml_node} 
# <CUSTOMER> 
# [1] <ID>178</ID> 
# [2] <FIRST.NAME>Alvaro</FIRST.NAME> 
# [3] <LAST.NAME>Juarez</LAST.NAME> 
# [4] <ADDRESS>123 Park Ave</ADDRESS> 
# [5] <ZIP>57701</ZIP>

爲什麼功能，適用於customer.01節點集，返回ID爲customer.02呢？

xml_find_all(customer.01, "//MEMBERS/CUSTOMER/ID") 
# {xml_nodeset (2)} 
# [1] <ID>178</ID> 
# [2] <ID>934</ID>

如何僅返回來自該節點集的值？

~~~

好了，所以這裏是一個小皺紋在下面的解決方案，又關係到xml_find_all（）函數的範圍。它說它可以應用於文檔，節點或節點集。

library(xml2) 
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml" 
doc <- read_xml(url) 
xml_ns_strip(doc) 
nd <- xml_find_all(doc, "//LiquidationOfAssetsDetail|//LiquidationDetail") 

nodei <- nd[[1]] 
nodei 
# {xml_node} 
# <LiquidationOfAssetsDetail> 
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc> 
# [2] <DistributionDt>2014-11-04</DistributionDt> 
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt> 
# [4] <EIN>abcdefghi</EIN> 
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName> 
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ... 
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt> 

xml_text(xml_find_all(nodei, "AssetsDistriOrExpnssPaidDesc")) 
# [1] "LAND"

但不是這一個：但是......當應用於節點集

此情況下工作

nodei <- xml_children(nd[[i]]) 
nodei 
# {xml_nodeset (7)} 
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc> 
# [2] <DistributionDt>2014-11-04</DistributionDt> 
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt> 
# [4] <EIN>abcdefghi</EIN> 
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName> 
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ... 
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt> 

xml_text(xml_find_all(nodei, "AssetsDistriOrExpnssPaidDesc")) 
# character(0)

我猜這是一個應用問題xml_find_all（）所有節點集的元素而不是範圍問題？

來源

2017-08-24 why.knot

目前，使用的是從根與XPath的雙斜槓，//，這意味着找到文檔中的所有項目符合這條道路既包括客戶的ID絕對路徑搜索。

對於特定節點在特定的子節點，只需使用選擇節點下的相對路徑：

xml_find_all(customer.01, "ID") 
# {xml_nodeset (1)} 
# [1] <ID>178</ID> 

xml_find_all(customer.01, "FIRST.NAME|LAST.NAME") 
# {xml_nodeset (2)} 
# [1] <FIRST.NAME>Alvaro</FIRST.NAME> 
# [2] <LAST.NAME>Juarez</LAST.NAME> 

xml_find_all(customer.01, "*") 
# {xml_nodeset (5)} 
# [1] <ID>178</ID> 
# [2] <FIRST.NAME>Alvaro</FIRST.NAME> 
# [3] <LAST.NAME>Juarez</LAST.NAME> 
# [4] <ADDRESS>123 Park Ave</ADDRESS> 
# [5] <ZIP>57701</ZIP>

來源

2017-08-24 18:38:34 Parfait

完美！謝謝！ –

太棒了！樂意效勞。順便說一下，有一種特殊的方式來說[感謝Stackoverflow]（https://meta.stackexchange.com/a/5235）。 – Parfait

這也適用：'xml_find_all（customer.01，「.//ID」）'，而這將返回所有情況：'xml_find_all（customer.01，「// ID」）'。但我沒有看到優勢的解決方案：'xml_find_all（customer.01，「ID」）' –

從R中XML2單獨的XML節點集隔離數據

回答

相關問題