如果有人能告訴我如何從xml中提取數據到R,我將不勝感激。下面是一個來自我的xml文件的1個化合物的例子,但真正的文件包含幾百個這樣的化合物。 我知道有幾個類似的問題發佈,但到目前爲止,我還沒有能夠開發以前的答案,以滿足我的要求。例如,我可以使用從xml提取數據到數據框
doc <- xmlParse("isotope information.xml")
xmlToDataFrame(
getNodeSet(doc, "//isotope"),
colClasses=c("character","numeric")
)
以提取很長的「MZ」和「丰度」值的列表中,但這些都沒有用,除非它們被連接到相關的化合物和樣品等。而且此方法不如果我進一步嘗試樹,似乎沒有工作,我認爲部分原因是因爲名稱中信息和/或空間的不同類型?
任何幫助非常感謝。我是R新手,直到開始使用此文件時才聽說過xPath!
<?xml version="1.0" encoding="utf-8"?>
<compounds>
<compound identifier="24.24_355.2087m/z" retentionTime="24.2409">
<statistics>
<anova>0.0013522641768629606</anova>
<maxFoldChange>18.444703223432118</maxFoldChange>
<mean lowest="Group A" highest="Group B" />
</statistics>
<condition name="Group A">
<sample name="ACU_S1_D1_MSonly" normalizedAbundance="0.16176030585271">
<adduct charge="2">
<isotope>
<mz>355.131459235488</mz>
<abundance>0.115052197015018</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S4_D1_MSonly" normalizedAbundance="0.648153833258576">
<adduct charge="2">
<isotope>
<mz>355.210174560547</mz>
<abundance>0.45734640955925</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S7_D1_MSonly" normalizedAbundance="0">
<adduct charge="2">
<isotope>
<mz>355.206065493636</mz>
<abundance>0</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S9_D1_MSonly" normalizedAbundance="0">
<adduct charge="2">
<isotope>
<mz>355.206065493636</mz>
<abundance>0</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S10_D1_MSonly" normalizedAbundance="1.40543741447065">
<adduct charge="2">
<isotope>
<mz>355.222929359468</mz>
<abundance>0.998472798001696</abundance>
</isotope>
<isotope>
<mz>355.785247802734</mz>
<abundance>0.00450361325390688</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S11_D1_MSonly" normalizedAbundance="0">
<adduct charge="2">
<isotope>
<mz>355.206065493636</mz>
<abundance>0</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S14_D1_MSonly" normalizedAbundance="0">
<adduct charge="2">
<isotope>
<mz>355.206065493636</mz>
<abundance>0</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S17_D1_MSonly" normalizedAbundance="0">
<adduct charge="2">
<isotope>
<mz>355.206065493636</mz>
<abundance>0</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
</condition>
<condition name="Group B">
<sample name="ACU_S2_D1_MSonly" normalizedAbundance="8.08281443709004">
<adduct charge="2">
<isotope>
<mz>355.217085869147</mz>
<abundance>6.34168970755279</abundance>
</isotope>
<isotope>
<mz>355.720179758869</mz>
<abundance>1.01208656740541</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S3_D1_MSonly" normalizedAbundance="1.74468788905785">
<adduct charge="2">
<isotope>
<mz>355.236865028724</mz>
<abundance>1.25719554540164</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S5_D1_MSonly" normalizedAbundance="1.20519908118674">
<adduct charge="2">
<isotope>
<mz>355.221413778655</mz>
<abundance>0.693123193025995</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S6_D1_MSonly" normalizedAbundance="11.8264838326202">
<adduct charge="2">
<isotope>
<mz>355.208446325351</mz>
<abundance>5.67846393951768</abundance>
</isotope>
<isotope>
<mz>355.712529790798</mz>
<abundance>0.718700468540192</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S12_D1_MSonly" normalizedAbundance="6.62039336582067">
<adduct charge="2">
<isotope>
<mz>355.195225774627</mz>
<abundance>4.80023810084345</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S13_D1_MSonly" normalizedAbundance="9.10340543014277">
<adduct charge="2">
<isotope>
<mz>355.231293658837</mz>
<abundance>8.75476514173928</abundance>
</isotope>
<isotope>
<mz>355.73683673041</mz>
<abundance>1.118534732035</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S15_D1_MSonly" normalizedAbundance="0">
<adduct charge="2">
<isotope>
<mz>355.206065493636</mz>
<abundance>0</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
<sample name="ACU_S16_D1_MSonly" normalizedAbundance="2.27851790546988">
<adduct charge="2">
<isotope>
<mz>355.242192813064</mz>
<abundance>1.25391817825056</abundance>
</isotope>
<isotope>
<mz>355.704849713088</mz>
<abundance>0</abundance>
</isotope>
</adduct>
</sample>
</condition>
</compound>
UPDATE原始發帖 嗨再次,爲您最初的幫助,但許多感謝使用XML和XML2我試圖闡述的答案得到我需要的數據幀,我還在掙扎,所以我添加更多信息...
我已經確定了XML文檔的結構之中:
# load necessary package(s)
library(XML)
# parse the xml file in to an R object call xmlfile
xmlfile = xmlTreeParse("QI isotope information.xml")
# check that the xmlfile object is recognised as an xml class
class(xmlfile) # the output should be: "XMLInternalDocument" "XMLAbstractDocument"
# find the root of the xml file
xmltop = xmlRoot(xmlfile)
class(xmltop) # "XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) # "compounds"
xmlSize(xmltop) # 4278
# the root of the xmlfile is "compounds" and it has 4278 children
# to view the content of the first child use:
xmltop[[1]]
# this contains all of the information from a unique compound identifier:
# <compound identifier="106.16_603.4571m/z" retentionTime="106.16268333333333">
# <statistics>
# <anova>1.1102230246251565E-16</anova>
# <maxFoldChange>321.93091917042375</maxFoldChange>
# <mean lowest="D9" highest="D1"/>
# </statistics>
# <condition name="D1">
# <sample name="ACU_S1_D1_MSonly" normalizedAbundance="2016.23926856296">
# <adduct charge="1">
# <isotope>
# <mz>603.509454467435</mz>
# <abundance>1017.28655636311</abundance>
# </isotope>
# <isotope>
# <mz>604.51484984744</mz>
# <abundance>346.272257983685</abundance>
# </isotope>
# <isotope>
# <mz>605.519216627667</mz>
# <abundance>64.8701884746552</abundance>
# </isotope>
# </adduct>
# </sample>
# N.B. this list is repeated for each sample name, in this case n=64 samples
xmlSize(xmltop[[1]]) # gives the number of nodes under the root, in this case n=5
xmlSApply(xmltop[[1]], xmlName) # gives the names of these 5 nodes
# statistics condition condition condition condition
# "statistics" "condition" "condition" "condition" "condition"
xmlSApply(xmltop[[1]], as.list)
xmltop[[1]][[1]] # takes you to the statistics output:
# <statistics>
# <anova>1.1102230246251565E-16</anova>
# <maxFoldChange>321.93091917042375</maxFoldChange>
# <mean lowest="D9" highest="D1"/>
# </statistics>
xmltop[[1]][[2]] # takes you to the "condition" level, i.e. condition name="D1"
xmltop[[1]][[2]][[1]] # takes you to the "sample" level, i.e. sample name="ACU_S1_D1_MSonly"
xmltop[[1]][[2]][[2]] # takes you to the "sample" level number 2, i.e. sample name="ACU_S2_D1_MSonly"
xmltop[[1]][[2]][[1]][[1]] # takes you to the "charge" level, i.e. adduct charge="1"
xmltop[[1]][[2]][[1]][[1]][[1]] # takes you to the "isotope" level, which includes m/z and abundance
# incrementing the last index number takes you to each isotope for that compound
# for example:
xmltop[[1]][[2]][[1]][[1]][[1]][[1]] # <mz>603.509454467435</mz>
xmltop[[1]][[2]][[1]][[1]][[1]][[2]] # <abundance>1017.28655636311</abundance>
xmltop[[1]][[2]][[1]][[1]][[2]][[1]] # <mz>604.51484984744</mz>
xmltop[[1]][[2]][[1]][[1]][[2]][[2]] # <abundance>346.272257983685</abundance>
xmltop[[1]][[2]][[1]][[1]][[3]][[1]] # <mz>605.519216627667</mz>
xmltop[[1]][[2]][[1]][[1]][[3]][[2]] # <abundance>64.8701884746552</abundance>
xmltop[[1]][[2]][[1]][[1]][[4]][[1]] # NULL
xmltop[[1]][[2]][[1]][[1]][[4]][[2]] # NULL
我不感興趣,統計部,但我想創建一個數據幀中的STR輸出會是李的東西科:
# > str(mydata) # returns a summary of the type/ format of each column
# 'data.frame': n obs. of n variables:
# $ compound : Factor w/ n levels
# $ retention_time :
# $ condition : Factor w/ 4 levels "D1","D3","D6","D9":
# $ sample_name : Factor w/ 16 levels "ACU_S1_D1","ACU_S2_D1...:
# $ isotope_mz : num
# $ isotope_abundance : num
我的最終目的是要能夠提取每個isotope_mz的丰度爲64個樣本的每一個。事實上,知道條件並不重要,因爲這可以從sample_name中確定。
N.B.我正在使用的xml文件是150 mb,並且具有> 4000個化合物x 64個樣本,每個化合物都有1到4個同位素,我需要mz和豐度。除了這裏要求的'R'方法之外,我還搜索並嘗試了大量的xml轉換器,但他們都沒有能夠破譯這個xml文件的結構。
你能不能給你需要的那種結果的例子嗎?你也可以詳細說明你的意思是「似乎沒有工作」(它沒有做什麼?)和「進一步樹」,舉例。 – LarsH
我會使用'XML :: xmlToList()'然後解析列表,因爲你認爲合適。此外,您在示例xml文件的底部缺少'<\compounds>' - 它不會在沒有它的情況下加載。 – dayne
謝謝你的問題。我已添加到原來的職位,包括我迄今爲止的一些進展,並更好地描述了我的目標。 –