1
我想反覆分離並從XML文檔操作節點集,但我在R.得到一個奇怪的行爲在xml_find_all()函數在XML2包是否有人可以幫助我瞭解應用於nodeset的函數的範圍?從R中XML2單獨的XML節點集隔離數據
下面是一個例子:
library(xml2)
library(dplyr)
doc <- read_xml("<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>" )
doc %>% xml_find_all('//*') %>% xml_path()
# [1] "/MEMBERS" "/MEMBERS/CUSTOMER[1]"
# [3] "/MEMBERS/CUSTOMER[1]/ID" "/MEMBERS/CUSTOMER[1]/FIRST.NAME"
# [5] "/MEMBERS/CUSTOMER[1]/LAST.NAME" "/MEMBERS/CUSTOMER[1]/ADDRESS"
# [7] "/MEMBERS/CUSTOMER[1]/ZIP" "/MEMBERS/CUSTOMER[2]"
# [9] "/MEMBERS/CUSTOMER[2]/ID" "/MEMBERS/CUSTOMER[2]/FIRST.NAME"
#[11] "/MEMBERS/CUSTOMER[2]/LAST.NAME" "/MEMBERS/CUSTOMER[2]/ADDRESS"
#[13] "/MEMBERS/CUSTOMER[2]/ZIP"
目的customer.01是包含僅來自該客戶的數據的節點集。
kids <- xml_children(doc)
customer.01 <- kids[[1]]
customer.01
# {xml_node}
# <CUSTOMER>
# [1] <ID>178</ID>
# [2] <FIRST.NAME>Alvaro</FIRST.NAME>
# [3] <LAST.NAME>Juarez</LAST.NAME>
# [4] <ADDRESS>123 Park Ave</ADDRESS>
# [5] <ZIP>57701</ZIP>
爲什麼功能,適用於customer.01節點集,返回ID爲customer.02呢?
xml_find_all(customer.01, "//MEMBERS/CUSTOMER/ID")
# {xml_nodeset (2)}
# [1] <ID>178</ID>
# [2] <ID>934</ID>
如何僅返回來自該節點集的值?
~~~
好了,所以這裏是一個小皺紋在下面的解決方案,又關係到xml_find_all()函數的範圍。它說它可以應用於文檔,節點或節點集。
library(xml2)
url <- "https://s3.amazonaws.com/irs-form-990/201501279349300635_public.xml"
doc <- read_xml(url)
xml_ns_strip(doc)
nd <- xml_find_all(doc, "//LiquidationOfAssetsDetail|//LiquidationDetail")
nodei <- nd[[1]]
nodei
# {xml_node}
# <LiquidationOfAssetsDetail>
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text(xml_find_all(nodei, "AssetsDistriOrExpnssPaidDesc"))
# [1] "LAND"
但不是這一個:但是......當應用於節點集
此情況下工作
nodei <- xml_children(nd[[i]])
nodei
# {xml_nodeset (7)}
# [1] <AssetsDistriOrExpnssPaidDesc>LAND</AssetsDistriOrExpnssPaidDesc>
# [2] <DistributionDt>2014-11-04</DistributionDt>
# [3] <MethodOfFMVDeterminationTxt>SEE ATTACH</MethodOfFMVDeterminationTxt>
# [4] <EIN>abcdefghi</EIN>
# [5] <BusinessName>\n <BusinessNameLine1Txt>GREENSBURG PUBLIC LIBRARY</BusinessNameLine1Txt>\n</BusinessName>
# [6] <USAddress>\n <AddressLine1Txt>1110 E MAIN ST</AddressLine1Txt>\n <CityNm>GREENSBURG</CityNm>\n <StateAbbreviationCd>IN</StateAb ...
# [7] <IRCSectionTxt>501(C)(3)</IRCSectionTxt>
xml_text(xml_find_all(nodei, "AssetsDistriOrExpnssPaidDesc"))
# character(0)
我猜這是一個應用問題xml_find_all()所有節點集的元素而不是範圍問題?
完美!謝謝! –
太棒了!樂意效勞。順便說一下,有一種特殊的方式來說[感謝Stackoverflow](https://meta.stackexchange.com/a/5235)。 – Parfait
這也適用:'xml_find_all(customer.01, 「.//ID」)',而這將返回所有情況:'xml_find_all(customer.01, 「// ID」)'。 但我沒有看到優勢的解決方案:'xml_find_all(customer.01,「ID」)' –