2012-08-13 90 views
5

我有一個充滿不適當間隔句子的數據集。我試圖想出一種方法來刪除一些空間。在R中進行迭代拼寫檢查單詞向量

我開始與我轉換成詞的數據幀中的一句話:

> word5 <- "hotter the doghou se would be bec ause the co lor was diffe rent" 
> abc1 <- data.frame(filler1 = 1,words1=factor(unlist(strsplit(word5, split=" ")))) 
> abc1 
    filler1 words1 
1  1 hotter 
2  1 the 
3  1 doghou 
4  1  se 
5  1 would 
6  1  be 
7  1 bec 
8  1 ause 
9  1 the 
10  1  co 
11  1 lor 
12  1 was 
13  1 diffe 
14  1 rent 

接下來,我用下面的代碼來嘗試和拼寫檢查,並結合是之前或之後的詞的組合詞其中:

abc2 <- abc1 
i <- 1 
while(i < nrow(abc1)){ 
    print(abc2) 
    if(nrow(aspell(abc1$words1[i])) == 0){ 
    print(paste(i,"Words OK",sep=" | "));flush.console() 
    i <- i + 1 
    } 
else{ 
    if(nrow(aspell(abc1$words1[i])) > 0 & i != 1){ 
    preWord1 <- abc1$words1[i-1] 
    postWord1 <- abc1$words1[i+1] 
    badWord1 <- abc1$words1[i] 
    newWord1 <- factor(paste(preWord1,badWord1,sep="")) 
    newWord2 <- factor(paste(badWord1,postWord1,sep="")) 

    if(nrow(aspell(newWord1)) == 0 & nrow(aspell(newWord2)) != 0){ 
     abc2[i,"words1"] <-as.character(newWord1) 
     abc2 <- abc2[-c(i+1),] 
     print(paste(i,"word1",sep=" | "));flush.console() 
     i <- i + 1 
    } 

    if(nrow(aspell(newWord1)) != 0 & nrow(aspell(newWord2)) == 0){ 
     abc2[i ,"words1"] <-as.character(newWord2) 
     abc2 <- abc2[-c(i-1),] 
     print(paste(i,"word2",sep=" | "));flush.console() 
     i <- i + 1 
    } 

    } 
} 
} 

玩這個了一段時間我來,我需要某種類型的迭代器,但我不確定如何實現它在R.任何建議結束後?

+0

你能告訴我們如何不起作用?我認爲你可能正在尋找安樂死或樂隊的功能。如果你定義了你自己的函數,然後執行'lapply(abc1 $ words1,yourFunctionNameHere)',它會遍歷'adc1 $ words1'中的每個元素,並用作爲參數傳遞的那個元素調用你的函數。如果還有其他參數傳遞給該函數,則可以將這些參數傳遞給函數名稱 – 2012-08-13 09:16:21

回答

10

注意:我想出了一個完全不同的,更好的解決方案,因爲它避免了以前的解決方案的所有缺點。但我仍然想保留舊的解決方案。因此,我將其添加爲新答案,如果我錯了,請糾正我的錯誤。

在這種方法中,我重新格式化數據集了一下。基地就是我所說的一個字對對象。例如:

> word5 
[1] "hotter the doghou se would be bec ause the col or was diffe rent" 

會是什麼樣子:

> abc1_pairs 
    word1 word2 
1 hotter the 
2  the doghou 
3 doghou  se 
4  se would 
5 would  be 
6  be bec 
7  bec ause 
8 ause the 
9  the col 
10 col  or 
11  or was 
12 was diffe 
13 diffe rent 

接下來我們遍歷wordpairs,看看它們是有效的文字本身,遞歸地做下去,直到沒有找到有效的新詞(注意,

# Recursively delete wordpairs which lead to a correct word 
merge_wordpairs = function(wordpairs) { 
    require(plyr) 
    merged_pairs = as.character(mlply(wordpairs, merge_word)) 
    correct_words_idxs = which(sapply(merged_pairs, word_correct)) 
    if(length(correct_words_idxs) == 0) { 
    return(wordpairs) 
    } else { 
    message(sprintf("Number of words about to be merged in this pass: %s", length(correct_words_idxs))) 
    for(idx in correct_words_idxs) { 
     wordpairs = merge_specific_pair(wordpairs, idx, delete_pair = FALSE) 
    } 
    return(merge_wordpairs(wordpairs[-correct_words_idxs,])) # recursive call 
    } 
} 

施加到示例數據集,這將導致:一些額外的功能在這篇文章的底部)中列出

> word5 <- "hotter the doghou se would be bec ause the col or was diffe rent" 
> abc1 = strsplit(word5, split = " ")[[1]] 
> abc1_pairs = wordlist2wordpairs(abc1) 
> abc1_pairs 
    word1 word2 
1 hotter the 
2  the doghou 
3 doghou  se 
4  se would 
5 would  be 
6  be bec 
7  bec ause 
8 ause the 
9  the col 
10 col  or 
11  or was 
12 was diffe 
13 diffe rent 
> abc1_merged_pairs = merge_wordpairs(abc1_pairs) 
Number of words about to be merged in this pass: 4 
> merged_sentence = paste(wordpairs2wordlist(abc1_merged_pairs), collapse = " ") 
> c(word5, merged_sentence) 
[1] "hotter the doghou se would be bec ause the col or was diffe rent" 
[2] "hotter the doghouse would be because the color was different"  

附加功能需要:

# A bunch of functions 
# Data transformation 
wordlist2wordpairs = function(word_list) { 
    require(plyr) 
    wordpairs = ldply(seq_len(length(word_list) - 1), 
        function(idx) 
         return(c(word_list[idx], 
           word_list[idx+1]))) 
    names(wordpairs) = c("word1", "word2") 
    return(wordpairs) 
} 
wordpairs2wordlist = function(wordpairs) { 
    return(c(wordpairs[[1]], wordpairs[[2]][nrow(wordpairs)])) 
} 

# Some checking functions 
# Is the word correct? 
word_correct = function(word) return(nrow(aspell(factor(word))) == 0) 
# Merge two words 
merge_word = function(word1, word2) return(paste(word1, word2, sep = "")) 

# Merge a specific pair, option to postpone deletion of pair 
merge_specific_pair = function(wordpairs, idx, delete_pair = TRUE) { 
    # merge pair into word 
    merged_word = do.call("merge_word", wordpairs[idx,]) 
    # assign the pair to the idx above 
    if(!(idx == 1)) wordpairs[idx - 1, "word2"] = merged_word 
    if(!(idx == nrow(wordpairs))) wordpairs[idx + 1, "word1"] = merged_word 
    # assign the pair to the index below (if not last one) 
    if(delete_pair) wordpairs = wordpairs[-idx,] 
    return(wordpairs) 
} 
+0

@screechOwl,這個解決方案是如何在你的數據集上執行的?速度是否可以接受?任何其他錯誤? – 2012-08-14 16:35:13

3

你可以做的是使用遞歸。下面的代碼需要你的示例稍微修改一下。它檢查所有單詞是否正確,如果是,則返回單詞列表。如果不是,它會嘗試將該單詞與前面的單詞以及後面的單詞相結合。如果上述單詞的合併是正確的,則導致合併,看起來像paste(word_before, word, word_after)。在嘗試合併之後,在新單詞列表上調用合併單詞的功能。這種遞歸繼續下去,直到沒有錯誤的單詞。

# Wrap the spell checking in a function, makes your code much more readable 
word_correct = function(word) return(nrow(aspell(factor(word))) == 0) 
# Merge two words 
merge_word = function(word1, word2) return(paste(word1, word2, sep = "")) 
# Merge two words and replace in list 
merge_words_in_list = function(word_list, idx1, idx2) { 
    word_list[idx1] = merge_word(word_list[idx1], word_list[idx2]) 
    return(word_list[-idx2]) 
} 
# Function that recursively combines words 
combine_words = function(word_list) { 
    message("Current sentence: ", paste(word_list, collapse = " ")) 
    words_ok = sapply(word_list, word_correct) 
    if(all(words_ok)) { 
    return(word_list) 
    } else { 
    first_wrong_word = which(!words_ok)[1] 
    combination_before = merge_word(word_list[first_wrong_word], 
            word_list[first_wrong_word-1]) 
    if(word_correct(combination_before)) { 
     word_list = merge_words_in_list(word_list, first_wrong_word-1, 
             first_wrong_word) 
    } 
    combination_after = merge_word(word_list[first_wrong_word], 
            word_list[first_wrong_word+1]) 
    if(word_correct(combination_after)) { 
     word_list = merge_words_in_list(word_list, first_wrong_word, 
             first_wrong_word+1) 
    } 
    return(combine_words(word_list)) # Recursive call 
    } 
} 

運用這套功能(略有修改)版本的一句話:

word5 <- "hotter the doghou se would be bec ause the col or was diffe rent" 
abc1 = strsplit(word5, split = " ")[[1]] 
combine_words(abc1) 
Current sentence: hotter the doghou se would be bec ause the col or was diffe rent 
Current sentence: hotter the doghouse would be bec ause the col or was diffe rent 
Current sentence: hotter the doghouse would be because the col or was diffe rent 
Current sentence: hotter the doghouse would be because the col or was different 

的一些問題:

  • 還有問題,如果同時combination_beforecombination_after是無效的,程序就會陷入無限遞歸。只有當所有單詞都有效時,程序纔會停止。
  • 如果兩者合併在一起,而下一個單詞有效,該怎麼辦?
  • 該代碼僅合併錯誤的詞,例如'col'和'or'被aspell判斷爲好詞,你可能想合併。這導致了一個新的挑戰:在這種情況下,合併是顯而易見的,但在大型數據集中,如何將一組本身,正確的單詞組合起來可能並不明顯。

但是,我認爲這個例子很好地說明了遞歸方法。

+0

這太棒了!非常感謝你。這些問題都是可以生存的。數據是非常糟糕的,所以即使跳過兩個組合詞語都行不通的情景,也是朝正確方向邁出的一大步。 – screechOwl 2012-08-13 12:56:39

+0

我添加了一個新的答案,沒有這個解決方案的缺點,我認爲它應該更快。 – 2012-08-13 19:20:32