裝載XML到數據幀與父節點R屬性

我有要處理成R，其中data.frame的每行包含的一行的一個data.frame的XML文件（一個TEI編碼播放）播放，線路號碼，該線路的發言人，場景號碼和場景類型。 XML文件的正文是這樣的（但更長）：裝載XML到數據幀與父節點R屬性

<text> 
<body> 
<div1 type="scene" n="1"> 
    <sp who="fau"> 
     <l n="30">Settle thy studies, Faustus, and begin</l> 
     <l n="31">To sound the depth of that thou wilt profess;</l> 
     <l n="32">Having commenced, be a divine in show,</l> 
    </sp> 
    <sp who="eang"> 
     <l n="105">Go forward, Faustus, in that famous art,</l> 
    </sp> 
</div1> 
<div1 type="scene" n="2"> 
    <sp who="sch1"> 
     <l n="NA">I wonder what's become of Faustus, that was wont to make our schools ring with sic probo.</l> 
    </sp> 
    <sp who="sch2"> 
     <l n="NA">That shall we know, for see here comes his boy.</l> 
    </sp> 
    <sp who="sch1"> 
     <l n="NA">How now sirrah, where's thy master?</l> 
    </sp> 
    <sp who="wag"> 
     <l n="NA">God in heaven knows.</l> 
    </sp> 
</div1> 
</body> 
</text>

這個問題似乎類似於提出的問題here和here，但我的XML文件的結構略有不同，所以他們都沒有給我一個可行的解決方案。我已經成功地做到這一點：

library(XML) 
doc <- xmlTreeParse("data/faustus_sample.xml", useInternalNodes=TRUE) 

bodyToDF <- function(x){ 
    scenenum <- xmlGetAttr(x, "n") 
    scenetype <- xmlGetAttr(x, "type") 
    attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs) 
    linecontent <- sapply(xmlChildren(x), xmlValue) 
    data.frame(scenenum = scenenum, scenetype = scenetype, attributes = attributes, linecontent = linecontent, stringsAsFactors = FALSE) 
} 

res <- xpathApply(doc, '//div1', bodyToDF) 
temp.df <- do.call(rbind, res)

這會返回一個data.frame與「場景號」，「場景類型」和「揚聲器」完好，但我不能工作，如何打破它到每一行（並獲得關聯的行號）。

我嘗試導入文件作爲列表（通過xmlToList），但這給了我一個令人難以置信的列表清單列表，它也導致了很多不同的錯誤，如果我試圖使用for循環來訪問不同的元素（可怕的想法，我知道！）。

理想情況下，我正在尋找一個解決方案，將在其所有雜亂的完整的文件工作，也適用於其他類似結構化的XML文件。

我，使用R剛剛開始，處於虧損狀態我完全。任何援助，你可以提供將非常感激。

感謝您的幫助！

編輯：完整的XML文件的副本可here。對於SP元素

來源

2015-03-03 galenc

添加額外xpathApply：

bodyToDF <- function(x){ 
    scenenum <- xmlGetAttr(x, "n") 
    scenetype <- xmlGetAttr(x, "type") 
    sp <- xpathApply(x, 'sp', function(sp) { 
    who <- xmlGetAttr(sp, "who") 
    if(is.null(who)) 
     who <- NA 
    line_num <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"n")}) 
    linecontent = xpathSApply(sp, 'l', function(l) { xmlValue(l,"n")}) 
    data.frame(scenenum, scenetype, who, line_num, linecontent) 
    }) 
    do.call(rbind, sp) 
} 

res <- xpathApply(doc, '//div1', bodyToDF) 
temp.df <- do.call(rbind, res)

前4列

# > temp.df[,1:4] 
# scenenum scenetype who line_num 
# 1  1  scene fau  30 
# 2  1  scene fau  31 
# 3  1  scene fau  32 
# 4  1  scene eang  105 
# 5  2  scene sch1  NA 
# 6  2  scene sch2  NA 
# 7  2  scene sch1  NA 
# 8  2  scene wag  NA

來源

2015-03-03 08:55:49 bergant

原來，該解決方案完全適用於樣本XML，但對完整的文檔休息。據我所知，這兩個格式是相同的。上運行的線'RES < - xpathApply（文件， '// DIV1'，bodyToDF）'我得到的錯誤'「在data.frame錯誤（scenenum = xmlGetAttr（X，」N「），場景類型= xmlGetAttr（X，：參數意味着，不同的行數：0 – galenc 2015-03-05 04:09:30

1，「'在整個文件有一行沒有'who'屬性我（誰）''is.null（） – bergant 2015-03-05 08:03:17

答案處理只是這種情況下，是的更新。，剛剛抓住它，仍然是這樣一個R noob，但我會得到這個竅門。非常感謝你的幫助！ – galenc 2015-03-05 08:17:44

裝載XML到數據幀與父節點R屬性

回答

相關問題