2013-04-11 69 views
2

我有兩種載體,「速度」和「ID」它看起來像這樣一個簡單的數據幀:子集具有的功能的矢量的每一個級別,並返回一個新的數據幀(在R)

mydata 
ID  Speed 
1 1 6.031847 
2 1 7.050654 
3 1 7.769475 
4 1 8.838968 
5 1 9.956571 
6 1 11.146864 
7 1 11.967616 
8 1 13.078422 
9 1 14.214301 
10 1 14.974159 
11 2 16.048627 
12 2 17.070484 
.. . ......... 

我想使與速度值的前20%的數據幀的一個子集:

subset0.20<-subset(mydata, Speed > quantile(Speed, prob = 1 - 20/100, na.rm=T)) 

但我不希望它爲整個數據集,因爲這會回到我不等量的值的每個ID 。

因此,必須爲每個ID計算前20%的值,然後將每個ID的結果合併到一個新的數據幀中。然後,該數據幀將包括8行(這是我的原始數據集的20%,其中有40個行)

所以我做了一些咬甲癖的掏出一些頭髮,並試圖「for循環」,如:

for(i in 1:length(ID)){ 
    subset0.80<-subset(mydata[i], GForce > quantile(Speed, prob = 1 - 20/100, na.rm=T)) 
    } 

之類的東西適用於:

apply(mydata$Speed, 1 ,function(x) (subset(x > quantile(Speed, prob = 1 - 20/100, na.rm=T)))) 

但我只是沒有經驗有R得到它的工作..任何人都可以幫助我,並給我解釋一切,我做錯了什麼事情?

dput(mydata) 
structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 
4, 4, 4, 4, 4, 4), Speed = c(6.03184705225504, 7.05065401832249, 
7.76947483668907, 8.83896842017956, 9.95657139135043, 11.1468640558647, 
11.9676155772803, 13.0784218506988, 14.2143010441769, 14.9741594881612, 
16.0486271520862, 17.0704843261466, 17.9324808839116, 19.1169673939822, 
20.0528330256269, 20.9320440815571, 22.0379467007031, 22.962355355126, 
24.0764744246649, 25.1182530133201, 26.0456043859692, 26.9528777031822, 
27.9414746553538, 29.129640434174, 29.9443040639644, 30.9226103003052, 
31.9932286699133, 32.9925644101585, 33.9930708538141, 35.0124438238874, 
35.9215486087666, 36.9015465999988, 38.1044534443389, 39.0368063088987, 
40.272189714015, 40.8993100278334, 41.9790311160737, 43.1027190745506, 
43.8575622361406, 45.0499599122387)), .Names = c("ID", "Speed" 
), row.names = c(NA, -40L), class = "data.frame") 

回答

4

使用by,你可以調用每個ID的subset功能。那麼你可以用bind結果使用do.calllist轉換爲data.frame

你可以做這樣的事情:

do.call(rbind,by(mydata,mydata$ID,FUN= function(x) 
     subset(x, Speed > quantile(Speed, prob = 1 - 20/100, na.rm=T)))) 

    ID Speed 
1.9 1 14.21430 
1.10 1 14.97416 
2.19 2 24.07647 
2.20 2 25.11825 
3.29 3 33.99307 
3.30 3 35.01244 
4.39 4 43.85756 
4.40 4 45.04996 
+0

某處我看到有人提到,通常是'split' + 'lapply'通常使用'by'來縮小。 +1 – A5C1D2H2I1M1N2O1R2T1 2013-04-11 09:49:23

+0

+1,haa。幾乎與我的相同:) – 2013-04-11 09:51:33

+0

@agstudy:並不意味着按時間順序排列。 :) – 2013-04-11 09:59:24

4

幾個方式來做到這一點(很多,這可能會造成混淆)。下面是一個使用ave

GetMe <- with(mydata, 
       ave(Speed, ID, FUN = function(x) 
       x > quantile(x, prob = 1 - 20/100, na.rm = TRUE))) 

mydata[GetMe == 1, ] 
# ID Speed 
# 9 1 14.21430 
# 10 1 14.97416 
# 19 2 24.07647 
# 20 2 25.11825 
# 29 3 33.99307 
# 30 3 35.01244 
# 39 4 43.85756 
# 40 4 45.04996 

data.table包也不錯了這一點:

library(data.table) 
DT <- data.frame(mydata) 
DT[, list(Speed = Speed[Speed > quantile(Speed, prob = 1 - 20/100, na.rm = TRUE)]), by = "ID"] 
# ID Speed 
# 1: 1 14.21430 
# 2: 1 14.97416 
# 3: 2 24.07647 
# 4: 2 25.11825 
# 5: 3 33.99307 
# 6: 3 35.01244 
# 7: 4 43.85756 
# 8: 4 45.04996 
+0

+1不錯的使用大道 – 2013-04-11 09:51:16

2

一種方法是split您的數據通過ID然後用lapply dataframes的最終名單上找到你的頂部20%分位數。最後,使用do.callrbind將結果綁定在一起。

result <- do.call(rbind, lapply(split(mydata, mydata$ID), function(X) { 
    subset(X, Speed > quantile(Speed, prob = 1 - 20/100, na.rm = T)) 
})) 

result 
##  ID Speed 
## 1.9 1 14.21430 
## 1.10 1 14.97416 
## 2.19 2 24.07647 
## 2.20 2 25.11825 
## 3.29 3 33.99307 
## 3.30 3 35.01244 
## 4.39 4 43.85756 
## 4.40 4 45.04996 
2

試試這個

library(plyr) 

> ddply(mydata, .(ID), function(x) subset(x, Speed > quantile(Speed, prob = 1 - 20/100, na.rm=T))) 
    ID Speed 
1 1 14.21430 
2 1 14.97416 
3 2 24.07647 
4 2 25.11825 
5 3 33.99307 
6 3 35.01244 
7 4 43.85756 
8 4 45.04996 

@ SimonO101

嘗試使用meltreshape2

res <- aggregate(Speed ~ ID , data = mydata , function(x) { y <- rev(seq(length(x) , by = -1 ,length.out =(length(x)/5))) ; cbind(x[y[1]],x[y[2]]) }) 

> melt(res, id.vars="ID") 
    ID variable value 
1 1 Speed 14.21430 
2 2 Speed 24.07647 
3 3 Speed 33.99307 
4 4 Speed 43.85756 
5 1 Speed 14.97416 
6 2 Speed 25.11825 
7 3 Speed 35.01244 
8 4 Speed 45.04996 

也許,在這之後,你可能想要刪除第二列: - )。

1

這到底是什麼。下面是base R.使用aggregate一個在線解決方案,您得到每ID稍有不同的數據格式一行,並在它自己的列中的每個速度值:

aggregate(Speed ~ ID , data = mydata , function(x) { y <- rev(seq(length(x) , by = -1 ,length.out =(length(x)/5))) ; cbind(x[y[1]],x[y[2]]) }) 

    ID Speed.1 Speed.2 
#1 1 14.21430 14.97416 
#2 2 24.07647 25.11825 
#3 3 33.99307 35.01244 
#4 4 43.85756 45.04996 
+0

嗨,看看我的編輯。你可以使用'熔化' – Michele 2013-04-11 10:28:46

+1

@Michele謝謝,頂尖!是的,我應該這樣做,以保持OP所需的數據。無論如何,我已經在編輯之前爲你的答案+1了.-) – 2013-04-11 10:30:09