我有兩個數據幀:一個(「grny」),主要是一個引用,但在「yield」列中有一些數據I' m之後,另一個(「txie」)會因爲丟失數據而產生少量數據。我想合併它們,以便在「網站」中具有共同值的行中的所有單元格都是完整的。R:合併2個數據幀並將參考數據應用於匹配一個級別的所有行
其中最多的一年,通過一年的數據是:
txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))
一些年的年收益率數據大多參考:
grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))
我想要什麼:
site yield year prim lib lat
1 smithfield 7.009178 1999 nt 1109 43.61828
2 smithfield 8.472677 2000 nt 1109 43.61828
3 belleville 8.857462 1992 nt 122 74.08792
4 belleville 7.368488 1993 nt 122 74.08792
5 belleville NA 1994 nt 122 74.08792
6 nashua 7.494519 1990 nt 554 49.10000
8 nashua 8.696066 1991 ct 554 49.10000
9 nashua 8.051670 1992 nt 554 49.10000
我試過的東西:
rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line
例如,與外部連接合並我結束了:
site yield.x year.x prim.x yield.y year.y prim.y lat
1 belleville 6.766628 1992 <NA> NA NA nt 34.92136
2 belleville 6.845789 1993 <NA> NA NA nt 34.92136
3 belleville NA 1994 <NA> NA NA nt 34.92136
4 smithfield 8.841339 1999 nt NA NA <NA> 49.81872
5 smithfield 7.313310 2000 nt NA NA <NA> 49.81872
6 nashua NA NA <NA> 9.173229 1990 ct 49.10000
7 nashua NA NA <NA> 9.196018 1991 nt 49.10000
8 nashua NA NA <NA> 7.336645 1992 ct 49.10000
規定:我想保持NA的那些已經在「收益率」列(如。 1994年納舒厄)。 任何答案或有人可以告訴我,這種合併的例子(數據已經在一個或多個共享列,你沒有合併,每個df bringing in new columns除「by」變量)?
謝謝!
我錯了說你不應該只在現場,而是在組合現場x年? –
這個例子可能會令人困惑,但不,可以保持簡單,只需要網站就可以了,因爲我不會爲同一個網站添加多年 – Anomie