有沒有一種有效的方式來獲得data.table來模擬從plyr ldply時有空片被合併？

假設我有一個我想要合併到一個data.table文件的列表。我處理這個問題的根本途徑就是做這樣的事情：有沒有一種有效的方式來獲得data.table來模擬從plyr ldply時有空片被合併？

files <- dir(...) # The list of files to be combined 

read.data <- function(loadfile) { 
    data.dt <- data.table(read.csv(loadfile)); 
} 

data.dt <- data.table(file = files)[, read.data(file), by = file]

這種方法的問題是，當你得到空data.tables（從只包含標題行空文件引起的）。

Error in `[.data.table`(data.table(file = files), , read.data(file), : 
columns of j don't evaluate to consistent types for each group

有沒有辦法讓data.table無縫地正確連接空白或NULL值？這樣你就可以做一些像

if(dim(data.dt)[1] == 0) { 
    data.dt <- NULL 
}

而這應該解決我所遇到的大多數問題。

編輯：我應該指出，我已經使用plyr例程實現了這個邏輯。 ldply（）完美地工作，但不幸的是，當你嘗試傳遞超過一小部分文件時，非常慢並且內存密集。

來源

2011-09-08 kaybenleroll

這不是我期待plyr開銷會有很大影響的地方。大部分時間將被'read.csv'和最後的合併佔用。您加載了多少個文件？ 'ldply'與'llply'相比的速度如何？您也可以嘗試設置'stringsAsFactors = F' - 正確計算因子訂單會產生令人驚訝的大幅減速。 – hadley

這是data.table中的一個新bug。我已經提出here所以它不會被遺忘。

一個簡單的例子是：

DT = data.table(a=1:3,b=1:9) 
DT 
     a b 
[1,] 1 1 
[2,] 2 2 
[3,] 3 3 
[4,] 1 4 
[5,] 2 5 
[6,] 3 6 
[7,] 1 7 
[8,] 2 8 
[9,] 3 9 
DT[,if (a==2) NULL else sum(b),by=a] 
Error in `[.data.table`(DT, , if (a == 2) NULL else sum(b), by = a) : 
    columns of j don't evaluate to consistent types for each group

下面的錯誤是正確的：

DT[,if (a==2) 42 else sum(b),by=a] 
Error in `[.data.table`(DT, , if (a == 2) 42 else sum(b), by = a) : 
    columns of j don't evaluate to consistent types for each group

，並使用被修正：

DT[,if (a==2) 42L else sum(b),by=a] 
    a V1 
[1,] 1 12 
[2,] 2 42 
[3,] 3 18

，但我不認爲一個解決辦法的爲NULL，直到bug修復。

來源

2011-09-08 21:49:36

有沒有一種有效的方式來獲得data.table來模擬從plyr ldply時有空片被合併？

回答

相關問題