2016-11-24 94 views
4

我試圖將xml文件轉換爲數據框,但格式似乎是關閉的。我查看了不同的教程,儘管我在獲取需要使用for循環並導航解析文件的信息方面取得了中等成功,但我被告知此解決方案效率不高。將數據從xml轉換爲R數據框

我嘗試這樣做的代碼然後:

require(XML) 
parsed<-xmlParse("SEWL.xml") 
xmlToDataFrame(parsed) 

但它提供了一個錯誤:在[<-.data.frame誤差(*tmp*,I,名稱(節點[[I]]),值= C( 「\」 LL18179 \ 「\」 2016/08 \ 「0.32485.43896.59801.2131 \」 OK \ 「」: 列的

這其他代碼的工作,但格式是不是我所需要重複標:

require(XML) 
require(plyr) 
pldf<-ldply(xmlToList("SEWL.xml"),data.frame) 

產生的數據幀如下:

  .id    X..i.. text .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1 
1 technician    "John" <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
2 location    "CO" <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
3  temp    <NA> 21.3 celsius  <NA>   <NA>   <NA>    <NA>    <NA> 
4  runtype   "routine" <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
5  sample    <NA> <NA> 2323 "LL18179"  "2016/08"   0.3248   baseline   5.4389 
6  sample    <NA> <NA> 2323 "LL18179"  "2016/08"   0.3248   baseline   5.4389 
7  sample    <NA> <NA> 8979237 "AA09453"  "2016/03"   0.0117   baseline   5.6012 
8  sample    <NA> <NA> 8979237 "AA09453"  "2016/03"   0.0117   baseline   5.6012 
9  .attrs 2015_07_31_11_33_22 <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
10  .attrs   20150731 <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
11  .attrs    113322 <NA> <NA>  <NA>   <NA>   <NA>    <NA>    <NA> 
    test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1 
1    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
2    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
3    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
4    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
5     std   6.5980    data 1.2131  "OK"  laslum "ATR150607"   "2017/05" 
6     std   6.5980    data 1.2131  "OK"   3 "ATR150607"   "2017/05" 
7     std   1.1431    data 0.2041  "FAIL"  absat  <NA>    <NA> 
8     std   1.1431    data 0.2041  "FAIL"   2  <NA>    <NA> 
9    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
10    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
11    <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  <NA>    <NA> 
    test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5 
1    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
2    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
3    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
4    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
5   0.0673   baseline   4.9721    std   10.3851    data 
6   0.0673   baseline   4.9721    std   10.3851    data 
7    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
8    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
9    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
10    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
11    <NA>    <NA>    <NA>    <NA>    <NA>    <NA> 
    test.calc.1 test.result.1 test..attrs.1 
1   <NA>   <NA>   <NA> 
2   <NA>   <NA>   <NA> 
3   <NA>   <NA>   <NA> 
4   <NA>   <NA>   <NA> 
5  2.0886  "Warning"   atr 
6  2.0886  "Warning"    1 
7   <NA>   <NA>   <NA> 
8   <NA>   <NA>   <NA> 
9   <NA>   <NA>   <NA> 
10  <NA>   <NA>   <NA> 
11  <NA>   <NA>   <NA> 

這是我使用的XML文件示例:

<?xml version="1.0" encoding="UTF-8"?> 
<experiment name="abc123" date="20150731" time="113322"> 
    <technician>"John"</technician> 
    <location>"CO"</location> 
    <temp scale="celsius">21.3</temp> 
    <runtype>"routine"</runtype> 
    <sample id="2323"> 
     <test name="laslum" order="3"> 
      <code>"LL18179"</code> 
      <validuntil>"2016/08"</validuntil> 
      <meas name="baseline">0.3248</meas> 
      <meas name="std">5.4389</meas> 
      <meas name="data">6.5980</meas> 
      <calc>1.2131</calc> 
      <result>"OK"</result> 
     </test> 
     <test name="atr" order="1"> 
      <code>"ATR150607"</code> 
      <validuntil>"2017/05"</validuntil> 
      <meas name="baseline">0.0673</meas> 
      <meas name="std">4.9721</meas> 
      <meas name="data">10.3851</meas> 
      <calc>2.0886</calc> 
      <result>"Warning"</result> 
     </test> 
    </sample> 
    <sample id="8979237"> 
     <test name="absat" order="2"> 
      <code>"AA09453"</code> 
      <validuntil>"2016/03"</validuntil> 
      <meas name="baseline">0.0117</meas> 
      <meas name="std">5.6012</meas> 
      <meas name="data">1.1431</meas> 
      <calc>0.2041</calc> 
      <result>"FAIL"</result> 
     </test> 
    </sample> 
</experiment> 

而且我很希望得到數據框:

experiment technician location temp runtype sample test order  code validuntil baseline std data calc result  date time 
1  abc123  John  CO 21.3 routine 2323 laslum  3 LL18179 2016/08 0.3248 5.4389 6.5980 1.2131  OK 20150731 113322 
2  abc123  John  CO 21.3 routine 2323 atr  1 ATR150607 2017/05 0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322 
3  abc123  John  CO 21.3 routine 8979237 absat  2 AA09453 2016/03 0.0117 5.6012 1.1431 0.2041 FAIL 20150731 113322 

我不需要完全相同的格式,只需要足夠接近以便我可以將其轉換爲示例。

+0

還有一個'XML2'包可能是值得期待。 – lmo

回答

6

我們提供了兩種解析XML的方法。第一種方法(通過實驗/樣本/測試執行三重迭代)運行速度可能會更快,但第二種方法(在測試節點上使用單個循環,每個測試節點通過樹來取回祖先)具有更簡單的代碼。

1)在Note中使用Lines我們在實驗/樣本/測試節點上實現了三重xpathApply/xpathSApply迭代。 est分別代表當前這樣的節點。

library(XML) 
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE) 

do.call("rbind", xpathApply(doc, "//experiment", function(e) { 
    data.frame(experiment = xmlAttrs(e)[["name"]], 
     technician = xmlValue(e[["technician"]]), 
     location = xmlValue(e[["location"]]), 
     temp = xmlValue(e[["temp"]]), 
     runtype = xmlValue(e[["runtype"]]), 
     t(do.call(cbind, xpathApply(e, "sample", function(s) { 
      sample <- xmlAttrs(s)[["id"]] 
      xpathSApply(s, "test", function(t) { 
        c(sample = sample, 
         test = xmlAttrs(t)[["name"]], 
         order = xmlAttrs(t)[["order"]], 
         code = xmlValue(t[["code"]]), 
         validuntil = xmlValue(t[["validuntil"]]), 
         baseline = xmlValue(t["meas"][[1]]), 
         std = xmlValue(t["meas"][[2]]), 
         data = xmlValue(t["meas"][[3]]), 
         calc = xmlValue(t[["calc"]]), 
         result = xmlValue(t[["result"]]) 
      )})}))), 
     date = xmlAttrs(e)[["date"]], 
     time = xmlAttrs(e)[["time"]] 
)})) 

,並提供:

experiment technician location temp runtype sample test order 
1  abc123  "John"  "CO" 21.3 "routine" 2323 laslum  3 
2  abc123  "John"  "CO" 21.3 "routine" 2323 atr  1 
3  abc123  "John"  "CO" 21.3 "routine" 8979237 absat  2 
     code validuntil baseline std data calc result  date 
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131  "OK" 20150731 
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731 
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731 
    time 
1 113322 
2 113322 
3 113322 

2)這是在其中我們循環僅在測試節點的另一種方法,然後到達向上到父母和祖父母得到相應的樣品和實驗性功能信息。

library(XML) 
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE) 

do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node 
     s <- xmlParent(t) # s is sample node 
     e <- xmlParent(s) # e is experiment node 
     data.frame(experiment = xmlAttrs(e)[["name"]], 
      technician = xmlValue(e[["technician"]]), 
      location = xmlValue(e[["location"]]), 
      temp = xmlValue(e[["temp"]]), 
      runtype = xmlValue(e[["runtype"]]), 
      sample = xmlAttrs(s)[["id"]], 
      test = xmlAttrs(t)[["name"]], 
      order = xmlAttrs(t)[["order"]], 
      code = xmlValue(t[["code"]]), 
      validuntil = xmlValue(t[["validuntil"]]), 
      baseline = xmlValue(t["meas"][[1]]), 
      std = xmlValue(t["meas"][[2]]), 
      data = xmlValue(t["meas"][[3]]), 
      calc = xmlValue(t[["calc"]]), 
      result = xmlValue(t[["result"]]), 
      date = xmlAttrs(e)[["date"]], 
      time = xmlAttrs(e)[["time"]] 
     ) 
})) 

,並提供:

experiment technician location temp runtype sample test order 
1  abc123  "John"  "CO" 21.3 "routine" 2323 laslum  3 
2  abc123  "John"  "CO" 21.3 "routine" 2323 atr  1 
3  abc123  "John"  "CO" 21.3 "routine" 8979237 absat  2 
     code validuntil baseline std data calc result  date 
1 "LL18179" "2016/08" 0.3248 5.4389 6.5980 1.2131  "OK" 20150731 
2 "ATR150607" "2017/05" 0.0673 4.9721 10.3851 2.0886 "Warning" 20150731 
3 "AA09453" "2016/03" 0.0117 5.6012 1.1431 0.2041 "FAIL" 20150731 
    time 
1 113322 
2 113322 
3 113322 

注1:

順便說一句,如果你讀取輸入的XML文件,SEWL.xml,到Excel就會把做一個合理的工作它變成了表格格式,雖然需要進一步處理才能將其精確地轉換成問題中的表格。

注2:

作爲R對象的輸入Lines是:

Lines <- '<?xml version="1.0" encoding="UTF-8"?> 
<experiment name="abc123" date="20150731" time="113322"> 
    <technician>"John"</technician> 
    <location>"CO"</location> 
    <temp scale="celsius">21.3</temp> 
    <runtype>"routine"</runtype> 
    <sample id="2323"> 
     <test name="laslum" order="3"> 
      <code>"LL18179"</code> 
      <validuntil>"2016/08"</validuntil> 
      <meas name="baseline">0.3248</meas> 
      <meas name="std">5.4389</meas> 
      <meas name="data">6.5980</meas> 
      <calc>1.2131</calc> 
      <result>"OK"</result> 
     </test> 
     <test name="atr" order="1"> 
      <code>"ATR150607"</code> 
      <validuntil>"2017/05"</validuntil> 
      <meas name="baseline">0.0673</meas> 
      <meas name="std">4.9721</meas> 
      <meas name="data">10.3851</meas> 
      <calc>2.0886</calc> 
      <result>"Warning"</result> 
     </test> 
    </sample> 
    <sample id="8979237"> 
     <test name="absat" order="2"> 
      <code>"AA09453"</code> 
      <validuntil>"2016/03"</validuntil> 
      <meas name="baseline">0.0117</meas> 
      <meas name="std">5.6012</meas> 
      <meas name="data">1.1431</meas> 
      <calc>0.2041</calc> 
      <result>"FAIL"</result> 
     </test> 
    </sample> 
</experiment>' 
+0

這似乎是在正確的方向。如何通過調用實際的XML文件來替換Lines對象? – Variax

+0

刪除'asText = TRUE'並使用文件名代替'Lines'。爲了在SO上顯示,我們使用字符串輸入來保持演示文稿獨立。 –

+0

這個技巧。非常感謝 – Variax