2017-05-08 60 views
1

我有一個地址列表沒有完全格式化。大多數人擁有相同的基本結構,但大約五分之一沒有被正確輸入。通過在R中添加缺少的字來編輯地址字符串

df1包含24個地址,每個地址都是一個字符串。我的目標是找到似乎缺少單詞或數字的地址,並將它們添加到它們最可能屬於的每個字符串中。

我的方法是計算每個唯一字/數字出現在數據幀中的次數。出現在80%或更多行中的單詞被標識爲需要添加到每個地址的單詞。根據包含所有尋址元素的地址的格式,任何缺少的單詞都需要添加到「正確」位置。

我可以識別需要添加的單詞,但是如果不存在,我還沒有找到將單詞添加到每個字符串的方法;也沒有找到確保將它們添加到字符串中正確位置的方法。這是更加複雜的,因爲在我的真實數據集中,地址的格式不是跨地區恆定的,即在這個例子中,建築物號碼和道路名稱是第三和第四地址元素。有時他們會成爲第一,第二,第三等。所以我一直在努力開發的解決方案也需要動態。

這是我的樣本數據集:

df1 <- data.frame(V1=c("apt 23 5 roadname cityville b11abc", "apt 47 5 roadname cityville b11abc", "apt 24 roadname cityville b11abc", "apt 3 roadname cityville b11abc", "apt 44 5 roadname cityville b11abc", "apt 88 5 roadname cityville b11abc", "apt 7 5 roadname cityville b11abc", "apt 41 5 roadname cityville b11abc", "apt 55 5 roadname cityville b11abc", "apt 19 5 roadname cityville b11abc", "85 5 roadname cityville b11abc", "apt 12 roadname cityville b11abc", "apt 452 5 roadname cityville b11abc", "apt 1 5 roadname cityville b11abc", "99 5 roadname cityville b11abc", "apt 73 5 roadname cityville b11abc", "74 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt 63 5 roadname cityville b11abc", "apt 48 5 roadname cityville b11abc", "apt 123 5 roadname cityville b11abc", "apt 56 5 roadname cityville b11abc", "6 5 roadname cityville b11abc", "apt 2 6 roadname cityville b11abc"), stringsAsFactors = F) 

這是我的方法用於鑑定需要添加的話:

df1_words <- as.data.frame(table(t(as.data.frame(as.list(unlist(strsplit(df1$V1, " "))))))) 
df1_words_80 <- subset(df1_words, Freq >= round(nrow(df1)/100*80)) 

這是我後的輸出:

df2 <- data.frame(V1=c("apt 23 5 roadname cityville b11abc", "apt 47 5 roadname cityville b11abc", "apt 24 5 roadname cityville b11abc", "apt 3 5 roadname cityville b11abc", "apt 44 5 roadname cityville b11abc", "apt 88 5 roadname cityville b11abc", "apt 7 5 roadname cityville b11abc", "apt 41 5 roadname cityville b11abc", "apt 55 5 roadname cityville b11abc", "apt 19 5 roadname cityville b11abc", "apt 85 5 roadname cityville b11abc", "apt 12 5 roadname cityville b11abc", "apt 452 5 roadname cityville b11abc", "apt 1 5 roadname cityville b11abc", "apt 99 5 roadname cityville b11abc", "apt 73 5 roadname cityville b11abc", "apt 74 5 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt 63 5 roadname cityville b11abc", "apt 48 5 roadname cityville b11abc", "apt 123 5 roadname cityville b11abc", "apt 56 5 roadname cityville b11abc", "apt 6 5 roadname cityville b11abc", "apt 2 6 roadname cityville b11abc"), stringsAsFactors = F) 

編輯 應用後ng ikop的解決方案到一個真實的數據集我遇到了一個問題,當列表包含長度不同的地址時。我認爲這個問題是一些短地址(例如包含5個字)試圖在通常在位置6,7,8,9等處找到的頻繁詞彙插入到它們中,這是不可能的,因此產生錯誤。我可以想到兩個解決方案,無論是向後計數還是向前計數,或者可能是更簡單的選項(以及我認爲最適合我的特定需求的選項),只是忽略包含非常短的字符串的行。

我遇到的問題可以用df3與ikop的解決方案

df3 <- data.frame(V1=c("apt really long name 23 5 roadname cityville b11abc", "apt really long name 47 5 roadname cityville b11abc", "apt really long name 24 roadname cityville b11abc", "apt 3 roadname cityville b11abc", "apt really long name 44 5 roadname cityville b11abc", "apt really long name 88 5 roadname cityville b11abc", "apt really long name 7 5 roadname cityville b11abc", "apt really long name 41 5 roadname cityville b11abc", "apt really long name 55 5 roadname cityville b11abc", "apt really long name 19 5 roadname cityville b11abc", "85 5 roadname cityville b11abc", "apt really long name 12 roadname cityville b11abc", "apt really long name 452 5 roadname cityville b11abc", "apt really long name 1 5 roadname cityville b11abc", "99 5 roadname cityville b11abc", "apt really long name 73 5 roadname cityville b11abc", "74 roadname cityville b11abc", "apt 75 5 roadname cityville b11abc", "apt really long name 63 5 roadname cityville b11abc", "apt really long name 48 5 roadname cityville b11abc", "apt really long name 123 5 roadname cityville b11abc", "apt really long name 56 5 roadname cityville b11abc", "6 5 roadname cityville b11abc", "apt really long name 2 6 roadname cityville b11abc"), stringsAsFactors = F) 

回答

1

這是一個哈克的解決方案,將讓你最的方式,當被複制。

## For each word that appears in at least 80% of the rows compute 
## the most frequent position it appears in: 
library(dplyr) 
splitList <- strsplit(df1$V1, " ") 
wordVec <- unique(unlist(splitList)) 
wordFrequencyDf <- lapply(wordVec, function(theWord){ 
        freqWord <- sum(unlist(splitList) == theWord) 
        posVec <- unlist(lapply(splitList, function(x) which(x == theWord))) 
        mostFreqPos <- sort(table(posVec), decreasing = TRUE)[1] %>% names %>% as.numeric 
        data.frame(theWord, freqWord,mostFreqPos) 
       }) %>% 
     do.call('rbind',.) %>% 
     dplyr::mutate(theWord = as.character(theWord)) %>% 
     dplyr::filter(freqWord >= round(nrow(df1)*0.8)) %>% 
     dplyr::arrange(mostFreqPos) 

## Now loop over those words and insert the word in the relevant 
## position if necessary: 
for (ii in seq(along = wordFrequencyDf$theWord)){ 
    splitList <- lapply(splitList, function(x){ 
       relPos <- wordFrequencyDf$mostFreqPos[ii] 
       if (x[relPos] != wordFrequencyDf$theWord[ii]){ 
        if (relPos == 1){ 
         strBefore <- NULL      
        } else { 
         strBefore <- x[1:(relPos-1)] 
        }      
        if (relPos > length(x)){ 
         strAfter <- NULL       
        } else { 
         strAfter <- x[relPos:length(x)] 
        }     
        x <- c(strBefore, wordFrequencyDf$theWord[ii], strAfter) 
       } 
       x 
      }) 
} 

## Paste list together into a single string again: 
df2 <- data.frame(V1 = sapply(splitList, function(x) paste(x, collapse = " "))) 

結果:

df2 
#                V1 
# 1        apt 23 5 roadname cityville b11abc 
# 2        apt 47 5 roadname cityville b11abc 
# 3        apt 24 5 roadname cityville b11abc 
# 4        apt 3 5 roadname cityville b11abc 
# 5        apt 44 5 roadname cityville b11abc 
# 6        apt 88 5 roadname cityville b11abc 
# 7        apt 7 5 roadname cityville b11abc 
# 8        apt 41 5 roadname cityville b11abc 
# 9        apt 55 5 roadname cityville b11abc 
# 10       apt 19 5 roadname cityville b11abc 
# 11       apt 85 5 roadname cityville b11abc 
# 12       apt 12 5 roadname cityville b11abc 
# 13       apt 452 5 roadname cityville b11abc 
# 14        apt 1 5 roadname cityville b11abc 
# 15       apt 99 5 roadname cityville b11abc 
# 16       apt 73 5 roadname cityville b11abc 
# 17       apt 74 5 roadname cityville b11abc 
# 18       apt 75 5 roadname cityville b11abc 
# 19       apt 63 5 roadname cityville b11abc 
# 20       apt 48 5 roadname cityville b11abc 
# 21       apt 123 5 roadname cityville b11abc 
# 22       apt 56 5 roadname cityville b11abc 
# 23        apt 6 5 roadname cityville b11abc 
# 24 apt 2 5 roadname cityville b11abc 6 roadname cityville b11abc 

你可以看到,該方法在最後一行失敗。這裏原始線沒有位置3的"5"(如預期的代碼)。但問題是建築物號碼並未完全丟失,該字符串只包含一個不同的建築物號碼。該代碼,但是解釋爲缺少的建築物編號,並在位置3插入"5"

+0

非常感謝您的這一點。我已經將它應用到了我的真實數據集中,並且幾乎一直都在運行。但是,地址列表包含不同長度的字符串時會出現問題。我試圖編輯你的解決方案,基本上忽略包含非常短的字符串的行,但沒有太多的運氣。我爲這個問題添加了一個例子,證明我遇到的錯誤。 – Chris