2016-09-15 51 views
1

我有數據幀,看起來像這樣:中的R用火柴在多個行中刪除數據幀的行

content            ChatPosition 
This is a start line         START 
This is a middle line         MIDDLE 
This is a middle line         MIDDLE 
This is the last line         END 
This is a start line with a subsequent middle or end START 
This is another start line without a middle or an end START 
This is a start line         START 
This is a middle line         MIDDLE 
This is the last line         END 

content <- c("This is a start line" , "This is a middle line" , "This is a  middle line" ,"This is the last line" , 
     "This is a start line with a subsequent middle or end" , "This is  another start line without a middle or an end" , 
     "This is a start line" , "This is a middle line" , "This is the last line") 
ChatPosition <- c("START" , "MIDDLE" , "MIDDLE" , "END" , "START" ,"START" , "START" ,"MIDDLE" , "END") 
df <- data.frame(content, ChatPosition) 

我想刪除它包含一個開始,但該行僅在下一行在ChatPosition列中不包含MIDDLE或END。

content            ChatPosition 
This is a start line         START 
This is a middle line         MIDDLE 
This is a middle line         MIDDLE 
This is the last line         END 
This is a start line         START 
This is a middle line         MIDDLE 
This is the last line         END 

nrow(df) 
jjj <- 0 

for(jjj in 1:nrow(df)) 
{ 
    # Check of a match of two STARTS over over multiple lines. 

if (df$ChatPosition[jjj]=="START" && df$ChatPosition[jjj+1]=="START") 

    { 
    print(df$content[jjj]) 
    } 

} 

我能夠使用上面的代碼打印出我想要刪除的兩行我想知道什麼是最優雅的解決方案來刪除這些行?

如果在這裏有正確的方法或者是否有一個庫可以更容易地完成這種類型的事情,那麼這個方法也適用於嵌套方法嗎?

問候 喬納森

回答

2

這應該爲你工作。

df[!(as.character(df$ChatPosition) == "START" & 
    c(tail(as.character(df$ChatPosition), -1), "END") == "START"), ] 

        content ChatPosition 
1  This is a start line  START 
2  This is a middle line  MIDDLE 
3 This is a  middle line  MIDDLE 
4  This is the last line   END 
7  This is a start line  START 
8  This is a middle line  MIDDLE 
9  This is the last line   END 

[]的第一個參數返回一個邏輯向量,它告訴R要保留哪些行。我使用tail(, -1)來獲得下一個觀察df$ChatPosition作比較。請注意,由於df$ChatPosition是一個因子變量,因此有必要將df$ChatPosition轉換爲第二行中的字符,以便在最終位置連接「END」。

+0

感謝伊莫是一個非常好的和優雅(非常R)解決方案。我已經試過了,它給出了所需的結果。非常感謝。 –

3

使用grep。您可以與您比較這解決方案的真實數據集循環速度

start_indices = grep("START",ChatPosition) 
end_indices = grep("END",ChatPosition) 

match_indices = sapply(end_indices,function(x) tail(start_indices[(start_indices-x)<0],1)) 
match_indices 
# [1] 1 7 
del_indices = setdiff(start_indices,match_indices) 
del_indices 
# [1] 5 6 
DF_subset = DF[-del_indices,] 
DF_subset 
        # content ChatPosition 
# 1  This is a start line  START 
# 2  This is a middle line  MIDDLE 
# 3 This is a  middle line  MIDDLE 
# 4  This is the last line   END 
# 7  This is a start line  START 
# 8  This is a middle line  MIDDLE 
# 9  This is the last line   END 
+0

感謝Osssan,這也是一個有用的解決方案,使用grep歡呼 –

1

另一種選擇:

library(dplyr) 
filter(df, !(ChatPosition == "START" & lead(ChatPosition) == "START")) 

其中給出:

#      content ChatPosition 
#1  This is a start line  START 
#2  This is a middle line  MIDDLE 
#3 This is a  middle line  MIDDLE 
#4  This is the last line   END 
#5  This is a start line  START 
#6  This is a middle line  MIDDLE 
#7  This is the last line   END