2017-07-24 91 views
1

城市,州和地址我有如下字符串的形式地址:斯普利特地址字符串爲R中

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
           "1626 Aviation Way, Augusta, GA 30906, USA", 
           "325 Main St, Stratford, CT 06615, USA", 
           "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) 

我想它分成5列,比如街道,城市,州,郵政編碼,郵政編碼。 我該如何在R中做到這一點。

+0

查看'strsplit'或'regexpr'。 – ekstroem

+0

或者如果您使用的是數據框,則可以使用'tidyr'中的'separate()'函數。 –

+0

我試着做這個<-strsplit($ Adress,「,」)。我沒有得到正確的答案。以下是我嘗試在數據框中寫入時發生的錯誤:錯誤(函數(...,row.names = NULL,check.rows = FALSE,check.names = TRUE,: 參數意味着行數不同:4,5 – Kaushik

回答

1

這最終導致了很多步驟。你可以做得更少,但這是我做到的。我還假設yoru數據是在一個數據框中以每行一個地址開始。

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
       "1626 Aviation Way, Augusta, GA 30906, USA", 
       "325 Main St, Stratford, CT 06615, USA", 
       "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) 

> dat 
             Addresses 
1 1626 Aviation Way, Albuquerque, NM 30906, USA 
2  1626 Aviation Way, Augusta, GA 30906, USA 
3   325 Main St, Stratford, CT 06615, USA 
4 4205 Bessie Coleman Blvd, Tampa, FL 33607, USA 

現在,我們需要分割逗號來啓動,然後將狀態和zip分開。我也將通過分割逗號來刪除多餘的空格。

dat2 = sapply(dat$Addresses, strsplit, ",") 
dat2 = lapply(dat2, trimws) 

> dat2 
$`1626 Aviation Way, Albuquerque, NM 30906, USA` 
[1] "1626 Aviation Way" "Albuquerque"  "NM 30906"   "USA"    

$`1626 Aviation Way, Augusta, GA 30906, USA` 
[1] "1626 Aviation Way" "Augusta"   "GA 30906"   "USA"    

$`325 Main St, Stratford, CT 06615, USA` 
[1] "325 Main St" "Stratford" "CT 06615" "USA"   

$`4205 Bessie Coleman Blvd, Tampa, FL 33607, USA` 
[1] "4205 Bessie Coleman Blvd" "Tampa"     "FL 33607"     "USA"  

現在,我們需要將其重新置回數據框。

dat2 = data.frame(matrix(unlist(dat2), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE) 

> dat2 
         X1   X2  X3 X4 
1  1626 Aviation Way Albuquerque NM 30906 USA 
2  1626 Aviation Way  Augusta GA 30906 USA 
3    325 Main St Stratford CT 06615 USA 
4 4205 Bessie Coleman Blvd  Tampa FL 33607 USA 

接下來,我們可以將x3分成狀態和zip,然後刪除該列。

dat2$State = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][1]) 
dat2$Zip = sapply(dat2$X3, function(x) strsplit(x, " ")[[1]][2]) 

dat2 = dat2[, -3] 

> dat2 
         X1   X2 X4 State Zip 
1  1626 Aviation Way Albuquerque USA NM 30906 
2  1626 Aviation Way  Augusta USA GA 30906 
3    325 Main St Stratford USA CT 06615 
4 4205 Bessie Coleman Blvd  Tampa USA FL 33607 

最後,我們可以設置列名稱,我們就完成了。

colnames(dat2) = c("Street", "City", "Country", "State", "Zip") 
> dat2 
        Street  City Country State Zip 
1  1626 Aviation Way Albuquerque  USA NM 30906 
2  1626 Aviation Way  Augusta  USA GA 30906 
3    325 Main St Stratford  USA CT 06615 
4 4205 Bessie Coleman Blvd  Tampa  USA FL 33607 
+0

@kristoferesen,在執行數據幀命令時出現以下錯誤:」Warning message: In matrix(unlist(dat2 ),ncol = 4,byrow = TRUE): 數據長度[413]不是行數的倍數或倍數[0124]「 – Kaushik

+0

@Kaushik你的數據框看起來就像我的數據框恰好在它變回數據框之前? – Kristofersen

+0

@Kaushik確保在原始數據框中包含'stringsAsFactors = FALSE'。否則地址將是因素,並且strsplit將不起作用。 – Kristofersen

1

我用一行代碼解決了它。對於正則表達式專家可能看起來有點幼稚,但對於它的示例數據它可能工作。

library(stringr) 

dat = data.frame(Addresses = c("1626 Aviation Way, Albuquerque, NM 30906, USA", 
           "1626 Aviation Way, Augusta, GA 30906, USA", 
           "325 Main St, Stratford, CT 06615, USA", 
           "4205 Bessie Coleman Blvd, Tampa, FL 33607, USA"), stringsAsFactors = FALSE) 

str_match(dat$Addresses,"(.+), (.+), (.+) (.+), (.+)")[ ,-1] 
     [,1]      [,2]   [,3] [,4] [,5] 
[1,] "1626 Aviation Way"  "Albuquerque" "NM" "30906" "USA" 
[2,] "1626 Aviation Way"  "Augusta"  "GA" "30906" "USA" 
[3,] "325 Main St"    "Stratford" "CT" "06615" "USA" 
[4,] "4205 Bessie Coleman Blvd" "Tampa"  "FL" "33607" "USA"