R'aggregate'內存不足

我有一個關於微博客的數據集（600 Mb與5038720個觀察值），我試圖找出一個用戶發佈的有多少推文（與一箇中間計數相同的推文）一小時。這是該數據集的樣子：R'aggregate'內存不足

head(mydata) 

     uid    mid year month date hour min sec 
1738914174 3342412291119279 2011  8 3 21 4 12 
1738914174 3342413045470746 2011  8 3 21 7 12 
1738914174 3342823219232783 2011  8 5 0 17 5 
1738914174 3343095924467484 2011  8 5 18 20 43 
1738914174 3343131303394795 2011  8 5 20 41 18 
1738914174 3343386263030889 2011  8 6 13 34 25

這裏是我的代碼：

count <- function(x) { 
length(unique(na.omit(x))) 
} 
attach(mydata) 
hourPost <- aggregate(mid, by=list(uid, hour), FUN=count)

它掛有大約半小時，我發現，在使用的所有實際內存（24千兆）並開始使用虛擬內存。任何想法爲什麼這個小任務消耗瞭如此多的時間和記憶，我應該如何改進它？提前致謝！

來源

2013-07-23 leoce

使用包data.table：

mydata <- read.table(text="  uid    mid year month date hour min sec 
1738914174 3342412291119279 2011  8 3 21 4 12 
1738914174 3342413045470746 2011  8 3 21 7 12 
1738914174 3342823219232783 2011  8 5 0 17 5 
1738914174 3343095924467484 2011  8 5 18 20 43 
1738914174 3343131303394795 2011  8 5 20 41 18 
1738914174 3343386263030889 2011  8 6 13 34 25", 
header=TRUE, colClasses = c(rep("character",2),rep("numeric",6)), 
stringsAsFactors = FALSE) 

library(data.table) 
DT <- data.table(mydata) 
DT[, length(unique(na.omit(mid))), by=list(uid,hour)]

aggregate脅迫分組變量因素，這可能是吃了你的內存（我假設你有uid許多層面）。

有可能是最優化的可能性增加，但不提供具有代表性的測試案例。

來源

2013-07-23 08:45:12 Roland

非常感謝！我只是按照你的建議做了，並且花了不到一分鐘的時間才能返回結果！ – leoce

看到奇妙的是data.table。 – Roland

R'aggregate'內存不足

回答

相關問題