2017-02-21 376 views
2

我在R.NLP - 中的R識別和替換字(同義詞)

有問題的代碼我有一個數據集(問題)與4列和超過600K的觀察,其中的一列被命名爲' V3' 。 本專欄有類似'今日是什麼?'的問題。 我有第二個數據集(voc)有2列,其中一列名稱「單詞」和其他列名稱「同義詞」。如果在我的第一個數據集(問題)中存在來自列「同義詞」的第二個數據集(voc)的單詞,那麼我想從「單詞」列中替換它的單詞。

questions = cbind(V3=c("What is the day today?","Tom has brown eyes")) 
questions <- data.frame(questions) 

         V3                        
1 what is the day today?                        
2  Tom has brown eyes 

voc = cbind(word=c("weather", "a","blue"),synonyms=c("day", "the", "brown")) 
voc <- data.frame(voc) 

    word synonyms                          
1 weather  day                        
2  a  the                         
3 blue brown 

Desired output 

         V3      V5                     
1 what is the day today? what is a weather today?                       
2  Tom has brown eyes   Tom has blue eyes 

我寫了簡單的代碼,但它不起作用。

for (k in 1:nrow(question)) 
{ 
    for (i in 1:nrow(voc)) 
    { 
     question$V5<- gsub(do.call(rbind,strsplit(question$V3[k]," "))[which (do.call(rbind,strsplit(question$V3[k]," "))== voc[i,2])], voc[i,1], question$V3) 
    } 
} 

也許有人會試圖幫助我嗎? :)

我寫的第二個代碼,但它並沒有太多工作..

for(i in 1:nrow(questions)) 
{ 
    for(j in 1:nrow(voc)) 
     { 
     if (grepl(voc[j,k],do.call(rbind,strsplit(questions[i,]," "))) == TRUE) 
     { 
      new=matrix(gsub(do.call(rbind,strsplit(questions[i,]," "))[which(do.call(rbind,strsplit(questions[i,]," "))== voc[j,2])], voc[j,1], questions[i,])) 
      questions[i,]=new 
     } 
    } 
    questions = cbind(questions,c(new)) 
} 
+0

您的問題不太可能吸引答案,請提供一些樣本數據(涉及的數據框的前幾行),所需輸出的示例也會很好。 –

+0

好! :)謝謝你的建議 –

回答

1

首先,您使用stringsAsFactors = FALSE選項,無論是在計劃層面,或在您的數據導入是非常重要的。這是因爲除非另有說明,否則R默認將字符串轉換爲因子。因素在建模中非常有用,但是您希望對文本本身進行分析,因此您應該確保文字不會受到因素的影響。

我接觸到這個的方式是編寫一個函數,將每個字符串「爆炸」到一個向量中,然後使用匹配來替換這些字詞。矢量會重新組合成一個字符串。

我不知道如何執行這將給你的600K記錄。您可以查看一些處理字符串的R包,如stringrstringi,因爲它們可能具有某些功能。 match在速度上往往沒問題,但%in%可能是一個真正的野獸,取決於字符串的長度和其他因素。

# Start with options to make sure strings are represented correctly 
# The rest is your original code (mildly tidied to my own standard) 
options(stringsAsFactors = FALSE) 
questions <- cbind(V3 = c("What is the day today?","Tom has brown eyes")) 
questions <- data.frame(questions) 

voc <- cbind(word = c("weather","a","blue"), 
      synonyms = c("day","the","brown")) 
voc <- data.frame(voc) 

# This function takes: 
# - an input string 
# - a vector of words to replace 
# - a vector of the words to use as replacements 
# It returns a list of the original input and the changed version  
uFunc_FindAndReplace <- function(input_string,words_to_repl,repl_words) { 

    # Start by breaking the input string into a vector 
    # Note that we use [[1]] to get first list element of strsplit output 
    # Obviously this relies on breaking sentences by spacing 
    orig_words <- strsplit(x = input_string,split = " ")[[1]] 

    # If we find at least one of the words to replace in the original words, proceed 
    if(sum(orig_words %in% words_to_repl) > 0) { 

     # The right side selects the elements of orig_words that match words to be replaced 
     # The left side uses match to find the numeric index of those replacements within the words_to_repl vector 
     # This numeric vector is used to select the values from repl_words 
     # These then replace the values in orig_words 
     orig_words[orig_words %in% words_to_repl] <- repl_words[match(x = orig_words,table = words_to_repl,nomatch = 0)] 

     # We rebuild the sentence again, and return a list with original and new version 
     new_sent <- paste(orig_words,collapse = " ") 
     return(list(original = input_string,new = new_sent)) 
    } else { 

     # Otherwise we return the original version since no changes are needed 
     return(list(original = input_string,new = input_string)) 
    } 
} 

# Using do.call and rbind.data.frame, we can collapse the output of a lapply() 

do.call(what = rbind.data.frame, 
     args = lapply(X = questions$V3, 
         FUN = uFunc_FindAndReplace, 
         words_to_repl = voc$synonyms, 
         repl_words = voc$word)) 

> 
       original      new 
1 What is the day today? What is a weather today? 
2  Tom has brown eyes  Tom has blue eyes 
+1

偉大的工作!非常感謝:)它在我的大數據集上正常工作 –