編輯:與較大的數據集的基準。
計算每個區的使用功能tapply人口:
districtdata$population<-
tapply(towndata$population,towndata$district_ID,sum)[districts$district_ID]
一些基準測試,只是爲了好玩:
fn1<-function(districts,towns)
{
districts$population<-
tapply(towns$population,towns$district_ID,sum)[districts$district_ID]
districts
}
fn2<-function(districts,towns) #Roland's data.table approach:
{
districts <- data.table(districts,key="district_ID")
towns <- data.table(towns,key="district_ID")
temp<-towns[,list(district_pop=sum(population)),by=district_ID]
merge(districts,temp)
}
set.seed(42)
districts <- data.frame(district_ID=1:300,whatever=rnorm(300))
towns <- data.frame(town=1:100000,district_ID=rep(1:300,each=300),
population=rpois(300000,sample(c(1e3,1e4,1e5))))
microbenchmark(fn1(districts,towns),fn2(districts,towns))
Unit: milliseconds
expr min lq median uq max neval
fn1(districts, towns) 215.29266 231.47103 243.72353 265.28280 355.43895 100
fn2(districts, towns) 20.03636 27.51046 36.11116 58.56448 88.70766 100
+1所有這些工作,你仍然沒有收到什麼? – statquant 2013-03-16 18:47:52
我怎麼能概括這個總和'城鎮'中的所有列,而不是隻有一個(以上,人口),由'district_ID'索引? – 2013-10-12 10:09:16
'temp < - towns [,lapply(.SD,sum),by = district_ID]'也可能使用'.SDcols'。閱讀文檔。 – Roland 2013-10-12 10:43:33