看起來,從[.data.table
的data.table中選擇一個或多個列會生成一個或多個基礎向量的副本。我正在討論非常簡單的列選擇,按名稱,在j
中沒有計算表達式,並且在i
中沒有子集的行。更奇怪的是,data.frame中的列子集似乎沒有創建任何副本。我正在使用data.table版本data.table 1.10.4。下面提供了一個有關詳細信息和基準的簡單示例。我的問題是:爲什麼從data.table中選擇列會產生副本?
- 我做錯了什麼?
- 這是一個錯誤還是這是預期的行爲?
- 如果這是打算,什麼是最好的方法來按列分組data.table並避免額外的副本?
預期的用例涉及大型數據集,因此避免額外的副本是必須的(尤其是因爲基礎R似乎已經支持這一點)。
library(data.table)
set.seed(12345)
cpp_dt <- data.table(a = runif(1e6), b = rnorm(1e6), c = runif(1e6))
cols=c("a","c")
## naive/data.frame style of column selection
## leads to a copy of the column vectors in cols
subset_cols_1=function(dt,cols){
return(dt[,cols,with=F])
}
## alternative syntax, still results in a copy
subset_cols_2=function(dt,cols){
return(dt[,..cols])
}
## work-around that uses data.frame column selection,
## appears to avoid the copy
subset_cols_3=function(dt,cols){
setDF(dt)
subset=dt[,cols]
setDT(subset)
setDT(dt)
return(subset)
}
## another approach that makes a "shallow" copy of the data.table
## then NULLs the not needed columns by reference
## appears to also avoid the copy
subset_cols_4=function(dt,cols){
subset=dt[TRUE]
other_cols=setdiff(names(subset),cols)
set(subset,j=other_cols,value=NULL)
return(subset)
}
subset_1=subset_cols_1(cpp_dt,cols)
subset_2=subset_cols_2(cpp_dt,cols)
subset_3=subset_cols_3(cpp_dt,cols)
subset_4=subset_cols_4(cpp_dt,cols)
現在讓我們來看看內存分配並與原始數據進行比較。
.Internal(inspect(cpp_dt)) # original data, keep an eye on 1st and 3d vector
# @7fe8ba278800 19 VECSXP g1c7 [OBJ,MARK,NAM(2),ATT] (len=3, tl=1027)
# @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @10f1a3000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) -0.947317,-0.636669,0.167872,-0.206986,0.411445,...
# @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
使用[.data.table
方法進行子集列:
.Internal(inspect(subset_2)) # same, still copy
# @7fe8b6402600 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
# @115452000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @1100e7000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
使用的setDF
的順序,依次爲:
.Internal(inspect(subset_1)) # looks like data.table is making a copy
# @7fe8b9f3b800 19 VECSXP g0c7 [OBJ,NAM(1),ATT] (len=2, tl=1026)
# @114cb0000 14 REALSXP g0c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @1121ca000 14 REALSXP g0c7 [NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
仍使用[.data.table
,仍然製作副本的另一個語法版本[.data.frame
和setDT
。看,載體a
和c
不再複製!看起來基R方法更有效率/更小的內存佔用量?
.Internal(inspect(subset_3)) # "[.data.frame" is not making a copy!!
# @7fe8b633f400 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1026)
# @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
另一種方法是製作data.table的淺表副本,然後在新的data.table中引用所有額外的列。再次沒有複製。
.Internal(inspect(subset_4)) # 4th approach seems to also avoid the copy
# @7fe8b924d800 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=1027)
# @10e2ce000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.720904,0.875773,0.760982,0.886125,0.456481,...
# @10f945000 14 REALSXP g1c7 [MARK,NAM(2)] (len=1000000, tl=0) 0.717611,0.95416,0.191546,0.48525,0.539878,...
# ATTRIB: [removed]
現在讓我們來看看這四種方法的基準。它看起來像「[.data.frame」(subset_cols_3
)是一個明顯的贏家。
microbenchmark({subset_cols_1(cpp_dt,cols)},
{subset_cols_2(cpp_dt,cols)},
{subset_cols_3(cpp_dt,cols)},
{subset_cols_4(cpp_dt,cols)},
times=100)
# Unit: microseconds
# expr min lq mean median uq max neval
# { subset_cols_1(cpp_dt, cols) } 4772.092 5128.7395 8956.7398 7149.447 10189.397 53117.358 100
# { subset_cols_2(cpp_dt, cols) } 4705.383 5107.1690 8977.1816 6680.666 9206.164 53523.191 100
# { subset_cols_3(cpp_dt, cols) } 148.659 177.9595 285.4926 250.620 283.414 4422.968 100
# { subset_cols_4(cpp_dt, cols) } 193.912 241.9010 531.8308 336.467 384.844 20061.864 100
也許只是在等待這裏的更新:https://stackoverflow.com/a/26481429/'淺'功能尚未導出,但可能有助於這一點。 – Frank