2010-08-04 80 views
3

我不認爲這有問題,但是有沒有辦法將多層次,不均勻結構的列表信息合併爲一個「長」格式的數據幀?將不均勻的分層列表轉換爲數據幀

具體來說:

library(XML) 
library(plyr) 
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml" 
xml.parse <- xmlInternalTreeParse(xml.inning) 
xml.list <- xmlToList(xml.parse) 
## $top$atbat 
## $top$atbat$pitch 
##    des    id   type    x    y 
##   "Ball"   "310"    "B"   "70.39"  "125.20" 

凡下面是結構:

> llply(xml.list, function(x) llply(x, function(x) table(names(x)))) 
$top 
$top$atbat 
.attrs pitch 
    1  4 
$top$atbat 
.attrs pitch 
    1  4 
$top$atbat 
.attrs pitch 
    1  5 
$bottom 
$bottom$action 
    b des event  o pitch player  s 
    1  1  1  1  1  1  1 
$bottom$atbat 
.attrs pitch 
    1  5 
$bottom$atbat 
.attrs pitch 
    1  5 
$bottom$atbat 
.attrs pitch runner 
    1  5  1 
$bottom$atbat 
.attrs pitch runner 
    1  7  1 
$.attrs 
$.attrs$num 
character(0) 
$.attrs$away_team 
character(0) 
$.attrs$ 

我想是有從命名矢量從間距類別的數據幀,沿着(top,atbat,bottom)。因此,由於列數不同,我需要忽略不適合data.frame的級別。事情是這樣的:

first second third des  x 
1 top atbat pitch Ball 70.29 
2 top atbat pitch Strike 69.24 
3 bottom atbat pitch Out 67.22 

是否有這樣做的一個優雅的方式?謝謝!

+0

相關問題:http://stackoverflow.com/questions/2067098/how-to-transform-xml-data-into-a-data-frame – apeescape 2010-08-05 19:00:31

回答

5

我不知道優雅,但這個工程。那些更熟悉plyr的人可能可以提供更一般的解決方案。

cleanFun <- function(x) { 
    a <- x[["atbat"]] 
    b <- do.call(rbind,a[names(a)=="pitch"]) 
    c <- as.data.frame(b) 
} 
ldply(xml.list[c("top","bottom")], cleanFun)[,1:5] 
    .id    des id type  x 
1 top   Ball 310 B 70.39 
2 top Called Strike 311 S 118.45 
3 top Called Strike 312 S 86.70 
4 top In play, out(s) 313 X 79.83 
5 bottom   Ball 335 B 15.45 
6 bottom Called Strike 336 S 77.25 
7 bottom Swinging Strike 337 S 99.57 
8 bottom   Ball 338 B 106.44 
9 bottom In play, out(s) 339 X 134.76 
1

.id功能的ldply()是不錯,但似乎它們重疊一旦你做另一個ldply()

這是相當使用rbind.fill()一般功能:

aho <- ldply(llply(xml.list[[1]], function(x) ldply(x, function(x) rbind.fill(data.frame(t(x)))))) 
> aho[1:5,1:4] 
    .id              des id type 
1 pitch              Ball 310 B 
2 pitch            Called Strike 311 S 
3 pitch            Called Strike 312 S 
4 pitch           In play, out(s) 313 X 
5 .attrs Alexei Ramirez lines out to second baseman Ian Kinsler. <NA> <NA> 

.id第二ldply()丟失,因爲我們已經有了一個.id。我們可以通過將第一個.id命名爲不同的名稱來解決這個問題,但它看起來並不一致。

aho2 <- ldply(llply(xml.list[[1]], function(x) { 
    out <- ldply(x, function(x) rbind.fill(data.frame(t(x)))) 
    names(out)[1] <- ".id2" 
    out 
})) 
> aho2[1:5,1:4] 
    .id .id2              des id 
1 atbat pitch              Ball 310 
2 atbat pitch            Called Strike 311 
3 atbat pitch            Called Strike 312 
4 atbat pitch           In play, out(s) 313 
5 atbat .attrs Alexei Ramirez lines out to second baseman Ian Kinsler. <NA>