2013-04-21 86 views
2

我有以下代碼:如何避免在優化警告data.table

> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 
> dt 
    a b c d 
1: 3 1 11 21 
2: 3 2 12 22 
3: 3 3 13 23 
4: 3 4 14 24 
5: 3 5 15 25 
6: 4 6 16 26 
7: 4 7 17 27 
8: 4 8 18 28 
9: 4 9 19 29 
10: 4 10 20 30 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d 
1: 3 15 65 115 
2: 4 40 90 140 
> dt[,c(count=.N,lapply(.SD,sum)),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))' 
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future. 
done dogroups in 0 secs 
    a count b c d 
1: 3  5 15 65 115 
2: 4  5 40 90 140 

如何避免可怕的「效率極低」的警告?

我可以添加count列前加入:

> dt$count <- 1 
> dt 
    a b c d count 
1: 3 1 11 21  1 
2: 3 2 12 22  1 
3: 3 3 13 23  1 
4: 3 4 14 24  1 
5: 3 5 15 25  1 
6: 4 6 16 26  1 
7: 4 7 17 27  1 
8: 4 8 18 28  1 
9: 4 9 19 29  1 
10: 4 10 20 30  1 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d count 
1: 3 15 65 115  5 
2: 4 40 90 140  5 

但這並不顯得過於優雅...

+1

你要 「抑制」 的警告或高效地做事情? – Arun 2013-04-21 15:27:15

+1

我從來沒有說過「壓制」。我說「避免」,這意味着我想做正確的事情,並使我的代碼正確,高效地工作,以便不需要警告。 – sds 2013-04-21 16:23:56

+0

很明顯,我不太確定您是要「避免」「看到」警告還是「避免」「有」該警告。 – Arun 2013-04-21 16:37:52

回答

2

一個我能想到的方法是參考以分配count

dt.out <- dt[, lapply(.SD,sum), by = a] 
dt.out[, count := dt[, .N, by=a][, N]] 
# alternatively: count := table(dt$a) 

# a b c d count 
# 1: 3 15 65 115  5 
# 2: 4 40 90 140  5 

編輯1:我仍然認爲這只是消息而不是警告。但是,如果你仍然想避免這種情況,只是做:

dt.out[, count := as.numeric(dt[, .N, by=a][, N])] 

編輯2:非常有趣。做相當於多個:=分配產生相同的消息。

dt.out[, `:=`(count = dt[, .N, by=a][, N])] 
# Detected that j uses these columns: a 
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0 
# Detected that j uses these columns: <none> 
# Optimization is on but j left unchanged as '.N' 
# Starting dogroups ... done dogroups in 0 secs 
# Detected that j uses these columns: N 
# Assigning to all 2 rows 
# Direct plonk of unnamed RHS, no copy. 
+0

這會產生一個警告「項目1的RHS已被複制,要麼是NAMED矢量,要麼是再循環列表RHS。」 – sds 2013-04-21 16:41:12

+0

How do you say這是一個警告?它沒有提到任何有關無效率的信息......這只是一個信息。無論如何,我已經做了一個編輯,不要得到這個消息。 – Arun 2013-04-21 17:14:22

+0

我想你可能會發現'dt [,.N,by = a] [['N']]更高效,因爲在簡單地進行子集化時,不需要調用'[.data.table'的開銷。 – mnel 2013-04-21 23:48:28

2

此解決方案刪除有關指定元素的消息。但是你必須在之後放回這些名字。

require(data.table) 
options(datatable.verbose = TRUE) 

dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 

dt[, c(.N, unname(lapply(.SD, sum))), by = "a"] 

輸出

> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))' 
Starting dogroups ... done dogroups in 0.001 secs 
    a V1 V2 V3 V4 
1: 3 5 15 65 115 
2: 4 5 40 90 140 
+0

好(更好)的選擇。在最後使用'.N'後,使用setnames(dt.out,c(names(dt),「N」))(稍微簡單一些)就可以更容易地設置名稱。 – Arun 2013-04-21 17:45:19

+0

*顯着*較慢:'開始Dogroups ...完成dogroups在0.277秒vs'開始Dogroups ...在2.929秒完成dogroup' – sds 2013-04-21 17:53:25

+0

@sds,你不清楚你比較哪兩個解決方案。 – djhurio 2013-04-21 18:01:31