我有兩個數據幀。第一個 - 保存在一個名爲B對象:R中兩個數據幀之間句子的最接近匹配
structure(list(CONTENT = c("@myntra beautiful teamä»ç where is the winners list?",
"The best ever Puma wishlist for Workout freaks, Head over to @myntra https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good",
"I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!",
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym. https://t.co/VeRy4G3c7X https://t.co/fOpBRWCdSh",
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym. https://t.co/VeRy4G3c7X.....",
"@DrDrupad @myntra #myPUMAcollection superb :)", "Super exclusive collection @myntra #myPUMAcollection https://t.co/Qm9dZzJdms",
"@myntra gave my best Love playing wid u Hope to win #myPUMAcollection",
"Check out PUMA Unisex Black Running Performance Gloves on Myntra! https://t.co/YD6IcvuG98 @myntra #myPUMAcollection",
"@myntra i have been mailing my issue daily since past week.All i get in reply is an auto generated assurance mail. 1st time pissed wd myntra"
), score = c(7.129, 7.08, 6.676, 5.572, 5.572, 5.535, 5.424,
5.205, 4.464, 4.245)), .Names = c("CONTENT", "score"), row.names = c(25L,
103L, 95L, 66L, 90L, 75L, 107L, 32L, 184L, 2L), class = "data.frame")
第二個數據庫 - 保存在對象命名爲c:
structure(list(CONTENT = c("The best ever for workout over to myntra like if you find it good",
"i finalised buy a top myntra and found the at in feel like i so in life"
)), .Names = "CONTENT", row.names = c(103L, 95L), class = "data.frame")
我想找到在第二個數據幀中的每個語句(C ),第一個數據幀(b)中最接近的匹配,並從第一個數據幀(b)返回得分。
例如,語句The best ever for workout over to myntra like if you find it good
與來自數據框1的第二條語句緊密匹配,因此我應該返回得分7.080
。
我嘗試使用代碼從堆棧溢出一些調整:
cp <- str_split(c$CONTENT, " ")
library(data.table)
nn <- lengths(cp) ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)`
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(cp), key="grp")
dt[,Score:=b$score[pmatch(X,b$CONTENT)]]
dt[!is.na(Score), list(avgScore=sum(Score)), by="grp"]
這將返回從DF僅C一個語句中的值。有人可以幫忙嗎?
你承諾這種方法'str_split' /'pmatch'用於確定給定短語的最佳匹配?因爲這樣的情況有適當的模糊匹配算法,可能會產生更好的結果。 – nrussell
@nrussell不是真的......如果你能讓我知道可以部署的那種模糊匹配算法 – LeArNr