2016-02-26 104 views
3

我正在嘗試使用基於相關個人的信息填寫遺漏案例(夫婦)。R - 根據配對數據填寫缺失信息

我的數據是這樣的

hserial sex age  children 
1 1001041 Male 30   Yes 
2 1001041 Female 32   Yes 
3 1001061 Male 22   No 
4 1001061 Female 21   No 
5 1001091 Male 38   Yes 
6 1001091 Female 37   Yes 
7 1001151 Male 31   No 
8 1001151 Female 27 Not eligible 
9 1001161 Male 33   Yes 
10 1001161 Female 35   Yes 

所以hserial夫婦標識符。第8行缺少一個案例Not eligible,但信息可從合作伙伴處獲得(第7行)。

我想找到一個簡潔的方式來填補這些失蹤與合作伙伴的信息。

我的想法做一些像

library(dplyr) 

childsum = dta %>% group_by(hserial, sex, children) %>% 
summarise(n = n()) %>% spread(sex, children) 

我會得到

hserial n Male  Female 
1 1001041 1 Yes   Yes 
2 1001061 1 No   No 
3 1001091 1 Yes   Yes 
4 1001151 1 No Not eligible 
5 1001161 1 Yes   Yes 

然後,我可以做類似

childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male) 
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female) 

所以對於每一個失蹤的MaleFemale信息填寫和反之亦然。 然後合併結果,以獲得

hserial sex age  children 
1 1001041 Male 30   Yes 
2 1001041 Female 32   Yes 
3 1001061 Male 22   No 
4 1001061 Female 21   No 
5 1001091 Male 38   Yes 
6 1001091 Female 37   Yes 
7 1001151 Male 31   No 
8 1001151 Female 27   No 
9 1001161 Male 33   Yes 
10 1001161 Female 35   Yes 

不知道如何做,這是一種巧妙的方法?

dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061, 
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female" 
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27, 
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L, 
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer", 
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial", 
"sex", "age", "children"), row.names = c(NA, -10L)) 

回答

3

這裏是它假設任何一對夫妻(包括兩個hserial S)應該始終有children同是/否記錄,除非兩個人都具有Not eligible項的方法。因此,它計算每對夫婦setdiff的可用children信息和Not eligible。如果所有(兩個)條目均爲「不合格」,則返回NA,因爲我認爲這是處理缺失值的更好方法(因爲您知道有許多專用功能可與NA一起使用,因此無法使用相同的功能方式爲Not eligible條目)。

dta %>% 
    group_by(hserial) %>% 
    mutate(children = if(all(children == "Not eligible")) NA_character_ else 
         setdiff(children, "Not eligible")) 
#Source: local data frame [10 x 4] 
#Groups: hserial [5] 
# 
# hserial sex age children 
#  (dbl) (fctr) (dbl) (chr) 
#1 1001041 Male 30  Yes 
#2 1001041 Female 32  Yes 
#3 1001061 Male 22  No 
#4 1001061 Female 21  No 
#5 1001091 Male 38  Yes 
#6 1001091 Female 37  Yes 
#7 1001151 Male 31  No 
#8 1001151 Female 27  No 
#9 1001161 Male 33  Yes 
#10 1001161 Female 35  Yes