2017-07-26 139 views
1

我是比較新的R.我有一個數據幀df看起來像這樣(一個字符變量只有...我的實際DF跨越100K +行,但爲了簡單起見,讓我們來看看5個行):查找所有R唯一字符串

V1 
oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects 
angioedema chemically induced, angioedema chemically induced, oximetry 
abo blood group system, imipramine poisoning, adverse effects 
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy 
thrombosis drug therapy 

我希望能夠輸出每一個唯一的字符串,使其看起來像這樣:

V1 
oximetry 
hydrogen peroxide adverse effects 
epoprostenol adverse effects 
angioedema chemically induced 
abo blood group system 
imipramine poisoning 
adverse effects 
isoenzymes 
myocardial infarction drug therapy 
thrombosis drug therapy 

難道我用的是tm包?我試着用dtm但我的代碼是低效的,因爲它會轉換dtm到矩陣,將需要大量的內存從100K +行。

請指教。謝謝!

回答

3

試試這個:

library(stringr) 
library(tidyverse) 

df <- data.frame(variable = c(
'oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects', 
'angioedema chemically induced, angioedema chemically induced, oximetry', 
'abo blood group system, imipramine poisoning, adverse effects', 
'isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy', 
'thrombosis drug therapy'), stringsAsFactors=FALSE) 

mutate(df, variable = str_split(variable, ', ')) %>% 
    unnest() %>% distinct() 
+0

出於某種原因,在不同的()函數沒有工作,但你的代碼沒有給我從一個逗號分隔的所有字符串的輸出,所以我能使用的代碼片斷另一個以減少它下降到獨特:'獨特< - 變異(DF,VAR = str_split(V1, ''))%>% UNNEST()%>%不同(); 獨特< - 子集(!獨特,複製(VAR))' – sweetmusicality

+1

你的代碼是非常快的,所以我對'tidyverse'和'stringr'留下深刻的印象! – sweetmusicality

+0

嗯..我不知道爲什麼'明顯的()'沒有工作... –

1

只有使用基本R,您可以使用strsplit()你的大串在每一個 「逗號+空格」 或 「\ n」 分裂。然後使用unique()只返回唯一的字符串:

text_vec <- c("oximetry, hydrogen peroxide adverse effects, epoprostenol adverse effects 
angioedema chemically induced, angioedema chemically induced, oximetry 
abo blood group system, imipramine poisoning, adverse effects 
isoenzymes, myocardial infarction drug therapy, thrombosis drug therapy 
thrombosis drug therapy") 

strsplit(text_vec, ", |\\n")[[1]]) 
# [1] "oximetry"       "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"  "angioedema chemically induced"  
# [5] "angioedema chemically induced"  "oximetry"       
# [7] "abo blood group system"    "imipramine poisoning"    
# [9] "adverse effects"     "isoenzymes"       
# [11] "myocardial infarction drug therapy" "thrombosis drug therapy"   
# [13] "thrombosis drug therapy" 

unique(strsplit(text_vec, ", |\\n")[[1]]) 
# [1] "oximetry"       "hydrogen peroxide adverse effects" 
# [3] "epoprostenol adverse effects"  "angioedema chemically induced"  
# [5] "abo blood group system"    "imipramine poisoning"    
# [7] "adverse effects"     "isoenzymes"       
# [9] "myocardial infarction drug therapy" "thrombosis drug therapy"