R中兩個數據幀之間句子的最接近匹配

我有兩個數據幀。第一個 - 保存在一個名爲B對象：R中兩個數據幀之間句子的最接近匹配

structure(list(CONTENT = c("@myntra beautiful teamä»ç where is the winners list?", 
"The best ever Puma wishlist for Workout freaks, Head over to @myntra https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good", 
"I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!", 
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym. https://t.co/VeRy4G3c7X https://t.co/fOpBRWCdSh", 
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym. https://t.co/VeRy4G3c7X.....", 
"@DrDrupad @myntra #myPUMAcollection superb :)", "Super exclusive collection @myntra #myPUMAcollection https://t.co/Qm9dZzJdms", 
"@myntra gave my best Love playing wid u Hope to win #myPUMAcollection", 
"Check out PUMA Unisex Black Running Performance Gloves on Myntra! https://t.co/YD6IcvuG98 @myntra #myPUMAcollection", 
"@myntra i have been mailing my issue daily since past week.All i get in reply is an auto generated assurance mail. 1st time pissed wd myntra" 
), score = c(7.129, 7.08, 6.676, 5.572, 5.572, 5.535, 5.424, 
5.205, 4.464, 4.245)), .Names = c("CONTENT", "score"), row.names = c(25L, 
103L, 95L, 66L, 90L, 75L, 107L, 32L, 184L, 2L), class = "data.frame")

第二個數據庫 - 保存在對象命名爲c：

structure(list(CONTENT = c("The best ever for workout over to myntra like if you find it good", 
"i finalised buy a top myntra and found the at in feel like i so in life" 
)), .Names = "CONTENT", row.names = c(103L, 95L), class = "data.frame")

我想找到在第二個數據幀中的每個語句（C ），第一個數據幀（b）中最接近的匹配，並從第一個數據幀（b）返回得分。

例如，語句The best ever for workout over to myntra like if you find it good與來自數據框1的第二條語句緊密匹配，因此我應該返回得分7.080。

我嘗試使用代碼從堆棧溢出一些調整：

cp <- str_split(c$CONTENT, " ") 
library(data.table) 
nn <- lengths(cp) ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)` 
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(cp), key="grp") 
dt[,Score:=b$score[pmatch(X,b$CONTENT)]] 
dt[!is.na(Score), list(avgScore=sum(Score)), by="grp"]

這將返回從DF僅C一個語句中的值。有人可以幫忙嗎？

來源

2016-03-05 LeArNr

你承諾這種方法'str_split' /'pmatch'用於確定給定短語的最佳匹配？因爲這樣的情況有適當的模糊匹配算法，可能會產生更好的結果。 – nrussell

@nrussell不是真的......如果你能讓我知道可以部署的那種模糊匹配算法 – LeArNr

下面是使用stringdist包中的stringsim的一種方法。有幾種method（算法）可供選擇 - 我在計算相似性時使用了Jaro distance度量標準，因爲它似乎爲您的數據產生了合理的結果。話雖如此，我對這個主題的經驗最多也是隨意的，所以您可能需要花一些時間閱讀和試驗stringdist提供的各種算法。

爲了減少混亂，我用這個包裝函數返回最相似（相似度最高值）元素的索引對於給定的字符串，

library(stringdist) 
library(data.table) 

best_match <- function(x, y, method = "jw", ...) { 
    which.max(stringsim(x, y, method, ...)) 
}

;並與琴絃的data.table相匹配，增加了對行操作啞指標：

Dt <- data.table(
    MatchPhrase = df_c$CONTENT, 
    Idx = 1:nrow(df_c) 
)

使用best_match，添加與最佳匹配的索引中的列（和下降次Ë虛擬Idx柱之後），

Dt[, MatchIdx := best_match(df_b$CONTENT, MatchPhrase), 
    by = "Idx"][,Idx := NULL]

和提取df_b相應的元素（我分別改名爲你的數據從b和c到df_b和df_c）：

Dt[, .(Score = df_b$score[MatchIdx], 
     BestMatch = df_b$CONTENT[MatchIdx]), 
    by = "MatchPhrase"] 
#                MatchPhrase Score 
#1:  The best ever for workout over to myntra like if you find it good 7.080 
#2: i finalised buy a top myntra and found the at in feel like i so in life 6.676 

#                                  BestMatch 
#1: The best ever Puma wishlist for Workout freaks, Head over to @myntra https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good 
#2:   I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!

來源

2016-03-05 16:22:04 nrussell

非常感謝你的詳細註釋nrussell ....現在就來探索這個 – LeArNr

非常感謝nrussell ... 。他用示例集完美地工作。我將探討更多關於用我的實際數據集來實現這一點的信息。再次感謝你。 – LeArNr

@nrussell ....我確實經過了Jaro距離....並且發現很有趣....感謝你介紹我模糊匹配算法....從來不知道......以前對我會很有幫助。 – LeArNr

R中兩個數據幀之間句子的最接近匹配

回答

相關問題