如何避免在優化警告data.table

我有以下代碼：如何避免在優化警告data.table

> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 
> dt 
    a b c d 
1: 3 1 11 21 
2: 3 2 12 22 
3: 3 3 13 23 
4: 3 4 14 24 
5: 3 5 15 25 
6: 4 6 16 26 
7: 4 7 17 27 
8: 4 8 18 28 
9: 4 9 19 29 
10: 4 10 20 30 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d 
1: 3 15 65 115 
2: 4 40 90 140 
> dt[,c(count=.N,lapply(.SD,sum)),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))' 
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future. 
done dogroups in 0 secs 
    a count b c d 
1: 3  5 15 65 115 
2: 4  5 40 90 140

如何避免可怕的「效率極低」的警告？

我可以添加count列前加入：

> dt$count <- 1 
> dt 
    a b c d count 
1: 3 1 11 21  1 
2: 3 2 12 22  1 
3: 3 3 13 23  1 
4: 3 4 14 24  1 
5: 3 5 15 25  1 
6: 4 6 16 26  1 
7: 4 7 17 27  1 
8: 4 8 18 28  1 
9: 4 9 19 29  1 
10: 4 10 20 30  1 
> dt[,lapply(.SD,sum),by="a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))' 
Starting dogroups ... done dogroups in 0 secs 
    a b c d count 
1: 3 15 65 115  5 
2: 4 40 90 140  5

但這並不顯得過於優雅...

來源

2013-04-21 sds

你要「抑制」的警告或高效地做事情？ – Arun 2013-04-21 15:27:15

我從來沒有說過「壓制」。我說「避免」，這意味着我想做正確的事情，並使我的代碼正確，高效地工作，以便不需要警告。 – sds 2013-04-21 16:23:56

很明顯，我不太確定您是要「避免」「看到」警告還是「避免」「有」該警告。 – Arun 2013-04-21 16:37:52

一個我能想到的方法是參考以分配count：

dt.out <- dt[, lapply(.SD,sum), by = a] 
dt.out[, count := dt[, .N, by=a][, N]] 
# alternatively: count := table(dt$a) 

# a b c d count 
# 1: 3 15 65 115  5 
# 2: 4 40 90 140  5

編輯1：我仍然認爲這只是消息而不是警告。但是，如果你仍然想避免這種情況，只是做：

dt.out[, count := as.numeric(dt[, .N, by=a][, N])]

編輯2：非常有趣。做相當於多個:=分配不產生相同的消息。

dt.out[, `:=`(count = dt[, .N, by=a][, N])] 
# Detected that j uses these columns: a 
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0 
# Detected that j uses these columns: <none> 
# Optimization is on but j left unchanged as '.N' 
# Starting dogroups ... done dogroups in 0 secs 
# Detected that j uses these columns: N 
# Assigning to all 2 rows 
# Direct plonk of unnamed RHS, no copy.

來源

2013-04-21 15:23:44 Arun

這會產生一個警告「項目1的RHS已被複制，要麼是NAMED矢量，要麼是再循環列表RHS。」 – sds 2013-04-21 16:41:12

How do you say這是一個警告？它沒有提到任何有關無效率的信息......這只是一個信息。無論如何，我已經做了一個編輯，不要得到這個消息。 – Arun 2013-04-21 17:14:22

我想你可能會發現'dt [，.N，by = a] [['N']]更高效，因爲在簡單地進行子集化時，不需要調用'[.data.table'的開銷。 – mnel 2013-04-21 23:48:28

此解決方案刪除有關指定元素的消息。但是你必須在之後放回這些名字。

require(data.table) 
options(datatable.verbose = TRUE) 

dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") 

dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]

輸出

> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"] 
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))' 
Starting dogroups ... done dogroups in 0.001 secs 
    a V1 V2 V3 V4 
1: 3 5 15 65 115 
2: 4 5 40 90 140

來源

2013-04-21 17:33:01 djhurio

好（更好）的選擇。在最後使用'.N'後，使用setnames（dt.out，c（names（dt），「N」））（稍微簡單一些）就可以更容易地設置名稱。 – Arun 2013-04-21 17:45:19

*顯着*較慢：'開始Dogroups ...完成dogroups在0.277秒vs'開始Dogroups ...在2.929秒完成dogroup' – sds 2013-04-21 17:53:25

@sds，你不清楚你比較哪兩個解決方案。 – djhurio 2013-04-21 18:01:31

如何避免在優化警告data.table

回答

相關問題