2017-08-24 60 views
1

我想反覆分離並從XML文檔操作節點集,但我在R.得到一個奇怪的行爲在xml_find_all()函數在XML2包是否有人可以幫助我瞭解應用於nodeset的函數的範圍?從R中XML2單獨的XML節點集隔離數據

下面是一個例子:

library(xml2) 
library(dplyr) 

doc <- read_xml("<MEMBERS> 
        <CUSTOMER> 
         <ID>178</ID> 
         <FIRST.NAME>Alvaro</FIRST.NAME> 
         <LAST.NAME>Juarez</LAST.NAME> 
         <ADDRESS>123 Park Ave</ADDRESS> 
         <ZIP>57701</ZIP> 
        </CUSTOMER> 
        <CUSTOMER> 
         <ID>934</ID> 
         <FIRST.NAME>Janette</FIRST.NAME> 
         <LAST.NAME>Johnson</LAST.NAME> 
         <ADDRESS>456 Candy Ln</ADDRESS> 
         <ZIP>57701</ZIP> 
        </CUSTOMER> 
        </MEMBERS>" ) 

doc %>% xml_find_all('//*') %>% xml_path() 
# [1] "/MEMBERS"      "/MEMBERS/CUSTOMER[1]"   
# [3] "/MEMBERS/CUSTOMER[1]/ID"   "/MEMBERS/CUSTOMER[1]/FIRST.NAME" 
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS" 
# [7] "/MEMBERS/CUSTOMER[1]/ZIP"  "/MEMBERS/CUSTOMER[2]"   
# [9] "/MEMBERS/CUSTOMER[2]/ID"   "/MEMBERS/CUSTOMER[2]/FIRST.NAME" 
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS" 
#[13] "/MEMBERS/CUSTOMER[2]/ZIP" 

目的customer.01是包含僅來自該客戶的數據的節點集。

kids <- xml_children(doc) 

customer.01 <- kids[[1]] 

customer.01 
# {xml_node} 
# <CUSTOMER> 
# [1] <ID>178</ID> 
# [2] <FIRST.NAME>Alvaro</FIRST.NAME> 
# [3] <LAST.NAME>Juarez</LAST.NAME> 
# [4] <ADDRESS>123 Park Ave</ADDRESS> 
# [5] <ZIP>57701</ZIP> 

爲什麼功能,適用於customer.01節點集,返回ID爲customer.02呢?

xml_find_all(customer.01, "//MEMBERS/CUSTOMER/ID") 
# {xml_nodeset (2)} 
# [1] <ID>178</ID> 
# [2] <ID>934</ID> 

如何僅返回來自該節點集的值?

~~~

好了,所以這裏是一個小皺紋在下面的解決方案,又關係到xml_find_all()函數的範圍。它說它可以應用於文檔,節點或節點集。

library(xml2) 
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml" 
doc <- read_xml(url) 
xml_ns_strip(doc) 
nd <- xml_find_all(doc, "//LiquidationOfAssetsDetail|//LiquidationDetail") 

nodei <- nd[[1]] 
nodei 
# {xml_node} 
# <LiquidationOfAssetsDetail> 
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc> 
# [2] <DistributionDt>2014-11-04</DistributionDt> 
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt> 
# [4] <EIN>abcdefghi</EIN> 
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName> 
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ... 
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt> 

xml_text(xml_find_all(nodei, "AssetsDistriOrExpnssPaidDesc")) 
# [1] "LAND" 

但不是這一個:但是......當應用於節點集

此情況下工作

nodei <- xml_children(nd[[i]]) 
nodei 
# {xml_nodeset (7)} 
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc> 
# [2] <DistributionDt>2014-11-04</DistributionDt> 
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt> 
# [4] <EIN>abcdefghi</EIN> 
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName> 
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ... 
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt> 

xml_text(xml_find_all(nodei, "AssetsDistriOrExpnssPaidDesc")) 
# character(0) 

我猜這是一個應用問題xml_find_all()所有節點集的元素而不是範圍問題?

回答

3

目前,使用的是從根與XPath的雙斜槓,//,這意味着找到文檔中的所有項目符合這條道路既包括客戶的ID絕對路徑搜索。

對於特定節點在特定的子節點,只需使用選擇節點下的相對路徑:

xml_find_all(customer.01, "ID") 
# {xml_nodeset (1)} 
# [1] <ID>178</ID> 

xml_find_all(customer.01, "FIRST.NAME|LAST.NAME") 
# {xml_nodeset (2)} 
# [1] <FIRST.NAME>Alvaro</FIRST.NAME> 
# [2] <LAST.NAME>Juarez</LAST.NAME> 

xml_find_all(customer.01, "*") 
# {xml_nodeset (5)} 
# [1] <ID>178</ID> 
# [2] <FIRST.NAME>Alvaro</FIRST.NAME> 
# [3] <LAST.NAME>Juarez</LAST.NAME> 
# [4] <ADDRESS>123 Park Ave</ADDRESS> 
# [5] <ZIP>57701</ZIP> 
+0

完美!謝謝! –

+0

太棒了!樂意效勞。順便說一下,有一種特殊的方式來說[感謝Stackoverflow](https://meta.stackexchange.com/a/5235)。 – Parfait

+0

這也適用:'xml_find_all(customer.01, 「.//ID」)',而這將返回所有情況:'xml_find_all(customer.01, 「// ID」)'。 但我沒有看到優勢的解決方案:'xml_find_all(customer.01,「ID」)' –