R - 比較類似但不相同的字符串

-1

我有一個包含發件人姓名和/或發件人電子郵件地址的數據集。R - 比較類似但不相同的字符串

sender_info = c('Kelvin [mailto:[email protected]]','Kelvin','Sheryl [mailto:[email protected]]','Sheryl <[email protected]>','Oscar',)

我想檢查唯一發件人的數量。從sender_info可以看出，有3個獨特的發件人 - Kelvin，Sheryl和Oscar。

我試着實施一些方法，但他們不工作。其中一個涉及使用R RecordLinkage庫中的levenshteinSim（）函數來檢查每個元素的相似程度。然而，當元素太不相同時（例如'Kelvin [mailto：[email protected]]'和'Kelvin'），此方法失敗。

我真的很感激，如果有人能給我一兩個提示如何解決這個問題。謝謝！

來源

2017-07-04 OinkOink

我會試着去標準化你的字符串。將它們分爲姓名和電子郵件，然後進行比較。看看r標籤上的許多正則表達式/正則表達式問題，以獲得有關提取符合模式的字符串的一些建議 - 嘗試在本網站上搜索「[r] [regex]」。 – thelatemail

'gsub（「[] [<>] | mailto：」，「」，sender_info）'作爲初始者清除不相關的位。 – thelatemail

@thelatemail好的，我會試試看！謝謝！ :) – OinkOink

如果你的數據結構總是喜歡提到的樣品，這些代碼將有助於：

sender_info = c('Kelvin [mailto:[email protected]]','Kelvin','Sheryl [mailto:[email protected]]','Sheryl <[email protected]>','Oscar') 
    new_sender <- sapply(strsplit(sender_info, split = " "), "[[", 1) 
    unique(new_sender) 
    #[1] "Kelvin" "Sheryl" "Oscar"

來源

2017-07-04 04:28:53

非常感謝你！ :) – OinkOink

的替代strsplit爲stringrstr_split。

library(stringr) 
unique(str_split(sender_info, pattern = " ", simplify = TRUE)[,1]) 
# [1] "Kelvin" "Sheryl" "Oscar"

來源

2017-07-04 08:16:16 HNSKD

R - 比較類似但不相同的字符串

回答

相關問題