2015-10-18 76 views
5

有沒有更好的方法來實現這一目標?我想從這個向量中刪除所有字符串,它們是其他元素的子字符串。移除另一個子字符串的矢量元素

words = c("please can you", 
    "please can", 
    "can you", 
    "how did you", 
    "did you", 
    "have you") 
> words 
[1] "please can you" "please can"  "can you"  "how did you" "did you"  "have you" 

library(data.table) 
library(stringr) 
dt = setDT(expand.grid(word1 = words, word2 = words, stringsAsFactors = FALSE)) 
dt[, found := str_detect(word1, word2)] 
setdiff(words, dt[found == TRUE & word1 != word2, word2]) 
[1] "please can you" "how did you" "have you" 

這個工程,但它似乎是矯枉過正,我很想知道一個更優雅的做法。

+3

'CJ'是'expand.grid快得多'data.table' ' – jenesaisquoi

+0

只是想爲這個任何人跟進一些肉。 'CJ' **更快**。我使用'12431'行,平均爲'15.69'字/行,對於'195065'字的總集合並通過'system.time(dt < - setDT(expand.grid(word1 = words,word2 = words ,stringsAsFactors = FALSE)))用戶系統中經過的8.414 3.387 13.854''system.time(dt1 < - CJ(words,words,unique = TRUE))'在用戶系統中經過了0.932 0.365 1.320'。數量級差異。 –

+0

真棒,感謝您的基準 –

回答

6

搜索的words每個組件words保留那些出現一次:

words[colSums(sapply(words, grepl, words, fixed = TRUE)) == 1] 

,並提供:

[1] "please can you" "how did you" "have you" 
+0

這真是太棒了 - 非常感謝! –

相關問題