2015-07-21 58 views
77

問題從分組數據

使用dplyr選擇第一個和最後一排,我怎麼在一個聲明中選擇分組數據的頂部和底部的意見/行?

數據&例

給定一個數據幀

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
       stopId=c("a","b","c","a","b","c","a","b","c"), 
       stopSequence=c(1,2,3,3,1,4,3,1,2)) 

我可以從使用slice每組頂部和底部的觀察,但使用兩個單獨的statments:

firstStop <- df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    slice(1) %>% 
    ungroup 

lastStop <- df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    slice(n()) %>% 
    ungroup 

我可以結合這兩個statmenets到一個選擇均爲頂部和底部觀察?

回答

126

有可能是一個更快的方式:

df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    filter(row_number()==1 | row_number()==n()) 
+37

'ROWNUMBER()以%C(%1,N( ))將避免兩次運行向量掃描的需要 – MichaelChirico

+5

@MichaelChirico I懷疑你省略了一個'_'?即'filter(row_number()%in%c(1,n()))' –

6

喜歡的東西:

library(dplyr) 

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
       stopId=c("a","b","c","a","b","c","a","b","c"), 
       stopSequence=c(1,2,3,3,1,4,3,1,2)) 

first_last <- function(x) { 
    bind_rows(slice(x, 1), slice(x, n())) 
} 

df %>% 
    group_by(id) %>% 
    arrange(stopSequence) %>% 
    do(first_last(.)) %>% 
    ungroup 

## Source: local data frame [6 x 3] 
## 
## id stopId stopSequence 
## 1 1  a   1 
## 2 1  c   3 
## 3 2  b   1 
## 4 2  c   4 
## 5 3  b   1 
## 6 3  a   3 

隨着do你幾乎可以在組,但@ jeremycg的答案執行任何數量的操作是方式更合適只是爲了這個任務。

+1

沒有考慮寫一個函數 - 當然是一個更復雜的方法。 – tospig

+1

這似乎過於複雜相比,只是使用'slice',如'DF%>%安排(stopSequence)%>%GROUP_BY(ID)%>%切片(C(1,N()))' – Frank

+3

不不同意(我指出jeremycg在帖子中是一個更好的答案),但在這裏有一個'do'的例子可能有助於其他人在'slice'不起作用的時候(例如對一個組進行更復雜的操作)。而且,你可以發表你的評論作爲答案(這是最好的答案)。 – hrbrmstr

13

dplyr,但它使用data.table的更直接:

library(data.table) 
setDT(df) 
df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ] 
# id stopId stopSequence 
# 1: 1  a   1 
# 2: 1  c   3 
# 3: 2  b   1 
# 4: 2  c   4 
# 5: 3  b   1 
# 6: 3  a   3 

更詳細的解釋:

# 1) get row numbers of first/last observations from each group 
# * basically, we sort the table by id/stopSequence, then, 
#  grouping by id, name the row numbers of the first/last 
#  observations for each id; since this operation produces 
#  a data.table 
# * .I is data.table shorthand for the row number 
# * here, to be maximally explicit, I've named the variable V1 
#  as row_num to give other readers of my code a clearer 
#  understanding of what operation is producing what variable 
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id] 
idx = first_last$row_num 

# 2) extract rows by number 
df[idx] 

一定要檢查出Getting Started維基得到data.table基本覆蓋

+1

或者'df [df [order(stopSequence),.I [c(1,.N)],keyby = id] $ V1]'。看到'id'出現兩次對我來說很奇怪。 – Frank

+0

您可以在'setDT'調用中設置按鍵。所以'訂單'電話不需要在這裏。 –

+1

@ArtemKlevtsov - 儘管如此,您可能並不總是想要設置按鍵。 – SymbolixAU

66

只是爲了完整性:您可以通過slice一個指標向量S:

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n())) 

這給

id stopId stopSequence 
1 1  a   1 
2 1  c   3 
3 2  b   1 
4 2  c   4 
5 3  b   1 
6 3  a   3 
4

我知道指定dplyr的問題。但是,因爲其他人使用其他套餐已發佈的解決方案,我決定有一個去使用其他的包太:

基礎包:

df <- df[with(df, order(id, stopSequence, stopId)), ] 
merge(df[!duplicated(df$id), ], 
     df[!duplicated(df$id, fromLast = TRUE), ], 
     all = TRUE) 

數據。表:

df <- setDT(df) 
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id] 

sqldf:

library(sqldf) 
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence 
     FROM df GROUP BY id 
     ORDER BY id, StopSequence, stopId") 
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence 
     FROM df GROUP BY id 
     ORDER BY id, StopSequence, stopId") 
sqldf("SELECT * FROM min 
     UNION 
     SELECT * FROM max") 

在一個查詢:

sqldf("SELECT * 
     FROM (SELECT id, stopId, min(stopSequence) AS StopSequence 
       FROM df GROUP BY id 
       ORDER BY id, StopSequence, stopId) 
     UNION 
     SELECT * 
     FROM (SELECT id, stopId, max(stopSequence) AS StopSequence 
       FROM df GROUP BY id 
       ORDER BY id, StopSequence, stopId)") 

輸出:

id stopId StopSequence 
1 1  a   1 
2 1  c   3 
3 2  b   1 
4 2  c   4 
5 3  a   3 
6 3  b   1