2017-04-19 69 views
1

我有一個非常大的數據框,我想通過id刪除一列之間的行,但只有當它們在這個值內時,不在開始或結束時。在這個例子中我要刪除的行或=行或= '計劃'在列的值之間刪除行

id <- c(1,1,1,1,1,1,2,2,2,2,2,2) 
fd <- c(101,102,103,104,105,106,101,102,103,104,105,106) 
rem <- c(100,120,120,140, 140, 150, 200,220,220,250, 300, 310) 
or <- c("base", "base", "plan", "base", "plan", "base", "plan", "base", 
"plan", "base", "plan", "base") 
df <- data.frame(id, fd, rem, or) 

結果之間 '基地':

id1 <- c(rep(1,5), rep(2,4)) 
fd1 <- c(101,102,103,106, 107, 101,103,105,106) 
or1 <- c("base", "base", "plan", "plan", "base", "plan", "plan", "plan", "base") 

df1 <- data.frame(id1,fd1,or1) 
+0

如果你有 '基地'/ '計劃' 的幾個實例,對於一些ID – akrun

+0

我想刪除'計劃'之間的每一行以獲得相同的ID。例如對於id 1,我想離開前兩個'基'和最後一個(在id 2開始之前) – AngeG

回答

4

兩種可能的解決方案:

1)使用鹼R:

idx <- ave(df$or, df$id, FUN = function(x) x=='base' & c('base',head(x,-1))=='plan' & c(tail(x,-1),'base')=='plan')=='FALSE' 
df[idx,] 

其給出:

id fd rem or 
1 1 101 100 base 
2 1 102 120 base 
3 1 103 120 plan 
5 1 105 140 plan 
6 1 106 150 base 
7 2 101 200 plan 
9 2 103 220 plan 
11 2 105 300 plan 
12 2 106 310 base 

2)使用data.table -package:

library(data.table) 
setDT(df) 

idx <- df[, .I[!(or=='base' & shift(or, fill = 'base')=='plan' & shift(or, fill = 'base', type = 'lead')=='plan')], id]$V1 
df[idx] 

這給:

id fd rem or 
1: 1 101 100 base 
2: 1 102 120 base 
3: 1 103 120 plan 
4: 1 105 140 plan 
5: 1 106 150 base 
6: 2 101 200 plan 
7: 2 103 220 plan 
8: 2 105 300 plan 
9: 2 106 310 base 

或者一氣呵成:

library(data.table) 
setDT(df)[df[, .I[!(or=='base' & shift(or, fill = 'base')=='plan' & shift(or, fill = 'base', type = 'lead')=='plan')], id]$V1] 

響應於該評論,則可以使用rle -function到'plan' -rows之間檢測多於一個'base' -rows如下(以鹼R):

# create new example dataset 
df2 <- df[c(1:3,4,4,5:7,8,8,9:12),] 

# the new example dataset: 
> df2 
    id fd rem or 
1 1 101 100 base 
2 1 102 120 base 
3 1 103 120 plan 
4 1 104 140 base 
4.1 1 104 140 base 
5 1 105 140 plan 
6 1 106 150 base 
7 2 101 200 plan 
8 2 102 220 base 
8.1 2 102 220 base 
9 2 103 220 plan 
10 2 104 250 base 
11 2 105 300 plan 
12 2 106 310 base 

# define function 
f <- function(x) { 
    rl <- rle(x) 
    rl$values <- !(rl$values == 'base' & c('base',head(rl$values,-1))=='plan' & c(tail(rl$values,-1),'base')=='plan') 
    inverse.rle(rl) 
} 

# apply the function to each id-group and create an index 
idx2 <- as.logical(ave(df2$or, df2$id, FUN = f)) 

# finally subset your data with the logical-index 
df2[idx2,] 

其給出:

> df2[idx2,] 
    id fd rem or 
1 1 101 100 base 
2 1 102 120 base 
3 1 103 120 plan 
5 1 105 140 plan 
6 1 106 150 base 
7 2 101 200 plan 
9 2 103 220 plan 
11 2 105 300 plan 
12 2 106 310 base 

在基礎R另一個選項(在評論由@弗蘭克的data.table建議啓發):

f2 <- function(x) { 
    i <- seq_along(x) 
    w <- which(x == 'plan') 
    b <- which(x == 'base') 
    ib <- b[b > head(w,1) & b < tail(w,1)] 
    !(i %in% ib) 
} 

idx3 <- unlist(by(df2$or, df2$id, f2)) 
df2[idx3,] 

隨着data.table你可以關注@弗蘭克的建議:

setDT(df2) 
df2[, keep := {isp = or == "plan"; wp = which(isp); r = 1:.N; isp | r < first(wp) | r > last(wp)}, by = id 
    ][!!keep] 

使用的數據

df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), 
        fd = c(101, 102, 103, 104, 105, 106, 101, 102, 103, 104, 105, 106), 
        rem = c(100, 120, 120, 140, 140, 150, 200, 220, 220, 250, 300, 310), 
        or = c("base", "base", "plan", "base", "plan", "base", "plan", "base", "plan", "base", "plan", "base")), 
       .Names = c("id", "fd", "rem", "or"), row.names = c(NA, -12L), class = "data.frame") 
+0

任何想法如何修改代碼以在'plan'之後刪除行我有兩個或多個行惠特'基地',然後再'計劃'。謝謝 – AngeG

+0

@AngeG查看更新,HTH – Jaap

+0

而不是找到哪些掉落,你可以標識那些保留(所有「計劃」,所有在第一次計劃之前或之後的計劃),如'df [,keep:= {isp =或==「計劃」; wp = which(isp); r = 1:.N; isp | r last(wp)},by = id]'或者類似的東西。 – Frank