如果有人能告訴我如何從xml中提取數據到R，我將不勝感激。下面是一個來自我的xml文件的1個化合物的例子，但真正的文件包含幾百個這樣的化合物。我知道有幾個類似的問題發佈，但到目前爲止，我還沒有能夠開發以前的答案，以滿足我的要求。例如，我可以使用從xml提取數據到數據框

doc <- xmlParse("isotope information.xml") 
xmlToDataFrame(
    getNodeSet(doc, "//isotope"), 
    colClasses=c("character","numeric") 
)

以提取很長的「MZ」和「丰度」值的列表中，但這些都沒有用，除非它們被連接到相關的化合物和樣品等。而且此方法不如果我進一步嘗試樹，似乎沒有工作，我認爲部分原因是因爲名稱中信息和/或空間的不同類型？

任何幫助非常感謝。我是R新手，直到開始使用此文件時才聽說過xPath！

<?xml version="1.0" encoding="utf-8"?> 
<compounds> 
    <compound identifier="24.24_355.2087m/z" retentionTime="24.2409"> 
    <statistics> 
     <anova>0.0013522641768629606</anova> 
     <maxFoldChange>18.444703223432118</maxFoldChange> 
     <mean lowest="Group A" highest="Group B" /> 
    </statistics> 
    <condition name="Group A"> 
     <sample name="ACU_S1_D1_MSonly" normalizedAbundance="0.16176030585271"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.131459235488</mz> 
      <abundance>0.115052197015018</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S4_D1_MSonly" normalizedAbundance="0.648153833258576"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.210174560547</mz> 
      <abundance>0.45734640955925</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S7_D1_MSonly" normalizedAbundance="0"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.206065493636</mz> 
      <abundance>0</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S9_D1_MSonly" normalizedAbundance="0"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.206065493636</mz> 
      <abundance>0</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S10_D1_MSonly" normalizedAbundance="1.40543741447065"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.222929359468</mz> 
      <abundance>0.998472798001696</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.785247802734</mz> 
      <abundance>0.00450361325390688</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S11_D1_MSonly" normalizedAbundance="0"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.206065493636</mz> 
      <abundance>0</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S14_D1_MSonly" normalizedAbundance="0"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.206065493636</mz> 
      <abundance>0</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S17_D1_MSonly" normalizedAbundance="0"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.206065493636</mz> 
      <abundance>0</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
    </condition> 
    <condition name="Group B"> 
     <sample name="ACU_S2_D1_MSonly" normalizedAbundance="8.08281443709004"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.217085869147</mz> 
      <abundance>6.34168970755279</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.720179758869</mz> 
      <abundance>1.01208656740541</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S3_D1_MSonly" normalizedAbundance="1.74468788905785"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.236865028724</mz> 
      <abundance>1.25719554540164</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S5_D1_MSonly" normalizedAbundance="1.20519908118674"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.221413778655</mz> 
      <abundance>0.693123193025995</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S6_D1_MSonly" normalizedAbundance="11.8264838326202"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.208446325351</mz> 
      <abundance>5.67846393951768</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.712529790798</mz> 
      <abundance>0.718700468540192</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S12_D1_MSonly" normalizedAbundance="6.62039336582067"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.195225774627</mz> 
      <abundance>4.80023810084345</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S13_D1_MSonly" normalizedAbundance="9.10340543014277"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.231293658837</mz> 
      <abundance>8.75476514173928</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.73683673041</mz> 
      <abundance>1.118534732035</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S15_D1_MSonly" normalizedAbundance="0"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.206065493636</mz> 
      <abundance>0</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
     <sample name="ACU_S16_D1_MSonly" normalizedAbundance="2.27851790546988"> 
     <adduct charge="2"> 
      <isotope> 
      <mz>355.242192813064</mz> 
      <abundance>1.25391817825056</abundance> 
      </isotope> 
      <isotope> 
      <mz>355.704849713088</mz> 
      <abundance>0</abundance> 
      </isotope> 
     </adduct> 
     </sample> 
    </condition> 
    </compound>

UPDATE原始發帖嗨再次，爲您最初的幫助，但許多感謝使用XML和XML2我試圖闡述的答案得到我需要的數據幀，我還在掙扎，所以我添加更多信息...

我已經確定了XML文檔的結構之中：

# load necessary package(s) 
library(XML) 

# parse the xml file in to an R object call xmlfile 
xmlfile = xmlTreeParse("QI isotope information.xml") 


# check that the xmlfile object is recognised as an xml class 
class(xmlfile) # the output should be: "XMLInternalDocument" "XMLAbstractDocument" 

# find the root of the xml file 
xmltop = xmlRoot(xmlfile) 
class(xmltop) # "XMLInternalElementNode" "XMLInternalNode"  "XMLAbstractNode" 
xmlName(xmltop) # "compounds" 
xmlSize(xmltop) # 4278 

# the root of the xmlfile is "compounds" and it has 4278 children 
# to view the content of the first child use: 
xmltop[[1]] 

# this contains all of the information from a unique compound identifier: 
# <compound identifier="106.16_603.4571m/z" retentionTime="106.16268333333333"> 
# <statistics> 
# <anova>1.1102230246251565E-16</anova> 
# <maxFoldChange>321.93091917042375</maxFoldChange> 
# <mean lowest="D9" highest="D1"/> 
# </statistics> 
# <condition name="D1"> 
# <sample name="ACU_S1_D1_MSonly" normalizedAbundance="2016.23926856296"> 
#  <adduct charge="1"> 
#  <isotope> 
#   <mz>603.509454467435</mz> 
#   <abundance>1017.28655636311</abundance> 
#  </isotope> 
#  <isotope> 
#   <mz>604.51484984744</mz> 
#   <abundance>346.272257983685</abundance> 
#  </isotope> 
#  <isotope> 
#   <mz>605.519216627667</mz> 
#   <abundance>64.8701884746552</abundance> 
#  </isotope> 
#  </adduct> 
# </sample> 
# N.B. this list is repeated for each sample name, in this case n=64 samples 

xmlSize(xmltop[[1]]) # gives the number of nodes under the root, in this case n=5 
xmlSApply(xmltop[[1]], xmlName) # gives the names of these 5 nodes 
# statistics condition condition condition condition 
# "statistics" "condition" "condition" "condition" "condition" 
xmlSApply(xmltop[[1]], as.list) 

xmltop[[1]][[1]] # takes you to the statistics output: 
# <statistics> 
# <anova>1.1102230246251565E-16</anova> 
# <maxFoldChange>321.93091917042375</maxFoldChange> 
# <mean lowest="D9" highest="D1"/> 
# </statistics> 

xmltop[[1]][[2]] # takes you to the "condition" level, i.e. condition name="D1" 

xmltop[[1]][[2]][[1]] # takes you to the "sample" level, i.e. sample name="ACU_S1_D1_MSonly" 

xmltop[[1]][[2]][[2]] # takes you to the "sample" level number 2, i.e. sample name="ACU_S2_D1_MSonly" 

xmltop[[1]][[2]][[1]][[1]] # takes you to the "charge" level, i.e. adduct charge="1" 

xmltop[[1]][[2]][[1]][[1]][[1]] # takes you to the "isotope" level, which includes m/z and abundance 

# incrementing the last index number takes you to each isotope for that compound 
# for example: 

xmltop[[1]][[2]][[1]][[1]][[1]][[1]] # <mz>603.509454467435</mz> 
xmltop[[1]][[2]][[1]][[1]][[1]][[2]] # <abundance>1017.28655636311</abundance> 
xmltop[[1]][[2]][[1]][[1]][[2]][[1]] # <mz>604.51484984744</mz> 
xmltop[[1]][[2]][[1]][[1]][[2]][[2]] # <abundance>346.272257983685</abundance> 
xmltop[[1]][[2]][[1]][[1]][[3]][[1]] # <mz>605.519216627667</mz> 
xmltop[[1]][[2]][[1]][[1]][[3]][[2]] # <abundance>64.8701884746552</abundance> 
xmltop[[1]][[2]][[1]][[1]][[4]][[1]] # NULL 
xmltop[[1]][[2]][[1]][[1]][[4]][[2]] # NULL

我不感興趣，統計部，但我想創建一個數據幀中的STR輸出會是李的東西科：

# > str(mydata) # returns a summary of the type/ format of each column 
# 'data.frame': n obs. of n variables: 
# $ compound : Factor w/ n levels 
# $ retention_time : 
# $ condition : Factor w/ 4 levels "D1","D3","D6","D9": 
# $ sample_name : Factor w/ 16 levels "ACU_S1_D1","ACU_S2_D1...: 
# $ isotope_mz : num 
# $ isotope_abundance : num

我的最終目的是要能夠提取每個isotope_mz的丰度爲64個樣本的每一個。事實上，知道條件並不重要，因爲這可以從sample_name中確定。

N.B.我正在使用的xml文件是150 mb，並且具有> 4000個化合物x 64個樣本，每個化合物都有1到4個同位素，我需要mz和豐度。除了這裏要求的'R'方法之外，我還搜索並嘗試了大量的xml轉換器，但他們都沒有能夠破譯這個xml文件的結構。

來源

2016-07-25 Jatin Burniston

你能不能給你需要的那種結果的例子嗎？你也可以詳細說明你的意思是「似乎沒有工作」（它沒有做什麼？）和「進一步樹」，舉例。 – LarsH

我會使用'XML :: xmlToList（）'然後解析列表，因爲你認爲合適。此外，您在示例xml文件的底部缺少'<\compounds>' - 它不會在沒有它的情況下加載。 – dayne

謝謝你的問題。我已添加到原來的職位，包括我迄今爲止的一些進展，並更好地描述了我的目標。 –

像這樣的東西應該工作：

library(XML) 
library(data.table) 

mylist <- xmlToList("isotope information.xml") 
mylist <- c(mylist, mylist, mylist) 

xtract <- function(x) { 
    data.table(compound_id = mylist[x]$compound$.attrs["identifier"], 
      sample_id = mylist[x]$compound$condition$sample$.attrs["name"], 
      mz = mylist[x]$compound$condition$sample$adduct$isotope[1], 
      abundance = mylist[x]$compound$condition$sample$adduct$isotope[2]) 
} 

rbindlist(lapply(seq_along(mylist), xtract)) 
#   compound_id  sample_id    mz   abundance 
# 1: 24.24_355.2087m/z ACU_S1_D1_MSonly 355.131459235488 0.115052197015018 
# 2: 24.24_355.2087m/z ACU_S1_D1_MSonly 355.131459235488 0.115052197015018 
# 3: 24.24_355.2087m/z ACU_S1_D1_MSonly 355.131459235488 0.115052197015018

來源

2016-07-26 01:23:59 dayne

我個人比較喜歡xml2所以這裏使用一個答案。我相信它可以得到改進，但它會給你一個長度等於化合物數量的列表，列表中的每個元素將是化合物標識符和mz和豐度列的data.frame。

library(xml2) 
x = read_xml(conn) # given in question 
#html_structure(x) # If you want to look at the structure 

output = list() 
# Initialize list and collect all compunds first 
a = xml_attrs(xml_find_all(x, "//compound")) 
# Iterate over compounds - I'm sure this could be done in an lapply... 
for(i in 1:length(a)){ 
    y = xml_child(x, i) 
    # Get the child to simplify the xpath to collect all in this one node 
    # Add a new element to the output list 
    output[[i]] = list(
    a[[1]][1], # Extract identifier (assumed you didn't want the retention time) and then a df of mz and abundance 
    data.frame(mz = xml_double(xml_find_all(y, "//isotope/mz")), abundance = xml_double(xml_find_all(x, "//isotope/abundance"))) 
       ) 
}

OUTPUT：

> output 
[[1]] 
[[1]][[1]] 
     identifier 
"24.24_355.2087m/z" 

[[1]][[2]] 
     mz abundance 
1 355.1315 0.115052197 
2 355.7048 0.000000000 
... 
31 355.2422 1.253918178 
32 355.7048 0.000000000

來源

2016-07-26 01:48:23 vincentmajor

從xml提取數據到數據框

回答

OUTPUT：

相關問題