2014-10-03 71 views
8

目標是將有時包含缺失記錄的嵌套列表轉換爲數據框。當存在缺失的記錄結構的一個例子是:將嵌套列表轉換爲數據框

str(mylist) 

List of 3 
$ :List of 7 
    ..$ Hit : chr "True" 
    ..$ Project: chr "Blue" 
    ..$ Year : chr "2011" 
    ..$ Rating : chr "4" 
    ..$ Launch : chr "26 Jan 2012" 
    ..$ ID  : chr "19" 
    ..$ Dept : chr "1, 2, 4" 
$ :List of 2 
    ..$ Hit : chr "False" 
    ..$ Error: chr "Record not found" 
$ :List of 7 
    ..$ Hit : chr "True" 
    ..$ Project: chr "Green" 
    ..$ Year : chr "2004" 
    ..$ Rating : chr "8" 
    ..$ Launch : chr "29 Feb 2004" 
    ..$ ID  : chr "183" 
    ..$ Dept : chr "6, 8" 

當沒有丟失記錄的列表可以被轉換成使用data.frame(do.call(rbind.data.frame, mylist))的數據幀。但是,如果缺少記錄,則會導致列不匹配。我知道有功能合併不匹配的列的數據框,但我還沒有找到一個可以應用於列表。對於所有變量,理想的結果將保持第2記錄爲NA。希望得到一些幫助。

編輯補充dput(mylist)

list(structure(list(Hit = "True", Project = "Blue", Year = "2011", 
Rating = "4", Launch = "26 Jan 2012", ID = "19", Dept = "1, 2, 4"), .Names = c("Hit", 
"Project", "Year", "Rating", "Launch", "ID", "Dept")), structure(list(
Hit = "False", Error = "Record not found"), .Names = c("Hit", 
"Error")), structure(list(Hit = "True", Project = "Green", Year = "2004", 
Rating = "8", Launch = "29 Feb 2004", ID = "183", Dept = "6, 8"), .Names = c("Hit", 
"Project", "Year", "Rating", "Launch", "ID", "Dept"))) 

回答

17

您還可以使用的rbindlist(至少v1.9.3)在data.table包:

library(data.table) 

rbindlist(mylist, fill=TRUE) 

##  Hit Project Year Rating  Launch ID Dept   Error 
## 1: True Blue 2011  4 26 Jan 2012 19 1, 2, 4    NA 
## 2: False  NA NA  NA   NA NA  NA Record not found 
## 3: True Green 2004  8 29 Feb 2004 183 6, 8    NA 
+1

[CRAN上現在有1.9.4版本](http://cran.r-project.org/web/packages/data.table/index.html)(儘管剩餘的二進制文件可能需要一天時間才能使用)。 – Arun 2014-10-03 11:27:18

7

您可以創建data.frames列表:

dfs <- lapply(mylist, data.frame, stringsAsFactors = FALSE) 

然後用其中的一個:

library(plyr) 
rbind.fill(dfs) 

或更快

library(dplyr) 
rbind_all(dfs) 

對於dplyr::rbind_all,我很驚訝它選擇使用""而不是NA作爲缺失數據。如果你刪除stringsAsFactors = FALSE,你會得到NA,但代價是警告......因此suppressWarnings(rbind_all(lapply(mylist, data.frame)))將是一個醜陋但快速的解決方案。

+2

'rbind_all()'已棄用。請改用'bind_rows()'。 – psychonomics 2017-01-30 16:25:08

5

我只是開發了一個解決方案,this question這裏是適用的,所以我會在這裏提供它,以及:

tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; }; 
mkcsv <- function(v) paste0(collapse=',',v); 
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key))); 

extractLevelColumns <- function(
    nodes, ## current level node selection 
    ..., ## additional arguments to data.frame() 
    keyList=list(), ## current key path under main list 
    sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns 
    mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars 
) { 
    cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList))); 
    if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list 
    tlList <- lapply(nodes,tl); 
    typeList <- do.call(c,lapply(tlList,`[[`,'type')); 
    if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList))); 
    type <- typeList[1L]; 
    if (type == 'namedlist') { ## hash; recurse 
     allKeys <- unique(do.call(c,lapply(nodes,names))); 
     ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname))); 
    } else if (type == 'list') { ## array; recurse 
     lenList <- do.call(c,lapply(tlList,`[[`,'len')); 
     maxLen <- max(lenList,na.rm=T); 
     allIndexes <- seq_len(maxLen); 
     ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to translate out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes 
    } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column 
     lenList <- do.call(c,lapply(tlList,`[[`,'len')); 
     maxLen <- max(lenList,na.rm=T); 
     if (is.null(sep)) { 
      ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen))); 
     } else { 
      ## keep original type if maxLen is 1, IOW don't stringify 
      ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen))); 
     }; ## end if 
    } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList))); 
    if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists 
    ret; 
}; ## end extractLevelColumns() 
## simple interface function 
flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...)); 

執行:

## define data 
mylist <- list(structure(list(Hit='True',Project='Blue',Year='2011',Rating='4',Launch='26 Jan 2012',ID='19',Dept='1, 2, 4'),.Names=c('Hit','Project','Year','Rating','Launch','ID','Dept')),structure(list(Hit='False',Error='Record not found'),.Names=c('Hit','Error')),structure(list(Hit='True',Project='Green',Year='2004',Rating='8',Launch='29 Feb 2004',ID='183',Dept='6, 8'),.Names=c('Hit','Project','Year','Rating','Launch','ID','Dept'))); 

## run it 
df <- flattenList(mylist); 
## extractLevelColumns(): 
## extractLevelColumns(): Hit 
## extractLevelColumns(): Project 
## extractLevelColumns(): Year 
## extractLevelColumns(): Rating 
## extractLevelColumns(): Launch 
## extractLevelColumns(): ID 
## extractLevelColumns(): Dept 
## extractLevelColumns(): Error 

df; 
##  Hit Project Year Rating  Launch ID Dept   Error 
## 1 True Blue 2011  4 26 Jan 2012 19 1, 2, 4    <NA> 
## 2 False <NA> <NA> <NA>  <NA> <NA> <NA> Record not found 
## 3 True Green 2004  8 29 Feb 2004 183 6, 8    <NA> 

從1.9.6開始,我的功能比data.table::rbindlist()更強大,因爲它可以處理任意數量的嵌套級別和跨分支的不同矢量長度。在鏈接的問題中,我的功能正確地將OP的列表平整爲data.frame,但data.table::rbindlist()"Error in rbindlist(jsonRList, fill = T) : Column 4 of item 16 is length 2, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table"而失敗。

+0

哇,最後我找到了一個解決方案來扁平我正面臨的列表類型。謝謝。 – jcarlos 2016-09-07 18:13:29