2017-06-22 60 views
3

我與+分離字的數據幀df但不希望的順序,當我執行分析無關緊要。例如,我有如何匹配不同的組合的字符串中的R

df <- as.data.frame(
     c(("Yellow + Blue + Green"), 
     ("Blue + Yellow + Green"), 
     ("Green + Yellow + Blue"))) 

目前,他們被視爲三個獨特的迴應,但我希望他們被認爲是相同的。我已經嘗試過蠻力方法,如ifelse陳述,但它們不適合大型數據集。

有沒有一種方法重新排列條款,使它們匹配或類似於反向combn函數,它可以識別它們是相同的組合,但順序不同?

謝謝!

回答

6
#DATA 
df <- data.frame(cols = 
       c(("Yellow + Blue + Green"), 
        ("Blue + Yellow + Green"), 
        ("Green + Yellow + Blue"), 
        ("Green + Yellow + Red")), stringsAsFactors = FALSE) 

#Split, sort, and then paste together 
df$group = sapply(df$cols, function(a) 
    paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", ")) 
df 
#     cols    group 
#1 Yellow + Blue + Green Blue, Green, Yellow 
#2 Blue + Yellow + Green Blue, Green, Yellow 
#3 Green + Yellow + Blue Blue, Green, Yellow 
#4 Green + Yellow + Red Green, Red, Yellow 

#Or you can convert to factors too (and back to numeric, if you like) 
df$group2 = as.numeric(as.factor(sapply(df$cols, function(a) 
     paste(sort(unlist(strsplit(a, " \\+ "))), collapse = ", ")))) 
df 
#     cols    group group2 
#1 Yellow + Blue + Green Blue, Green, Yellow  1 
#2 Blue + Yellow + Green Blue, Green, Yellow  1 
#3 Green + Yellow + Blue Blue, Green, Yellow  1 
#4 Green + Yellow + Red Green, Red, Yellow  2 
+1

感謝d.b!奇妙的作品。我應該更具體的一件事是,我仍然希望它是a + b + c格式,但通過更改'collapse'語句可以輕鬆修復它。 – Ablum89

0

我想提供有關這個我採取的,因爲目前還不清楚你想要什麼格式的輸出:

我用包stringriterators。使用df創建的d.b.

search <- c("Yellow", "Green", "Blue") 
L <- str_extract_all(df$cols, boundary("word")) 
sapply(iter(L), function(x) all(search %in% x)) 
[1] TRUE TRUE TRUE FALSE 
相關問題