2017-06-04 129 views
0

我有以下的數據框與列的X和Y,比賽接近相似字/詞

X         Y 
1 SAN DIEGO       FOND DU LAC 
2 THE RIO GRANDE      RIO GRANDE 
3 RIO GRANDE       RIO GRANDE 
4 WEST TENNESSEE      TENNESSEE 
5 EP De SAN JOAQUIN     De SAN JOAQUIN 
6 SOUTHERN VIRGINIA     VIRGINIA 
7 SOUTHERN VIRGINIA     SOUTHWESTERN VIRGINIA 
8 EN COLOMBIA       COLOMBIA 
9 THE EP De NORTHERN CALIFORNIA  De NORTHERN CALIFORNIA 
10 FLORIDA        NEW JERSY 

我想不匹配的行,1〜10行2-9匹配或接近的比賽,並都還好。我的預期數據幀是

X         Y 
1 SAN DIEGO       FOND DU LAC 
10 FLORIDA        NEW JERSY 

回答

0

R我們通過在每列中的空間分割字符串,檢查是否存在單詞之間任何intersect,找到listlengths和子集的數據集,其中長度爲0

df1[!lengths(Map(intersect, strsplit(df1$X, "\\s+"), strsplit(df1$Y, "\\s+"))),] 
#   X   Y 
#1 SAN DIEGO FOND DU LAC 
#10 FLORIDA NEW JERSY 

而是由每列分裂,我們也可以遍歷列,做split

df1[!lengths(do.call(Map, c(intersect, unname(lapply(df1, strsplit, split="\\s+"))))),] 
#  X   Y 
#1 SAN DIEGO FOND DU LAC 
#10 FLORIDA NEW JERSY 

或者另一種選擇是stringdist

library(stringdist) 
i1 <- with(df1, stringdist(X, Y, method = "qgram")) 
df1[i1 %in% tail(sort(i1), 2),] 
#   X   Y 
#1 SAN DIEGO FOND DU LAC 
#10 FLORIDA NEW JERSY