3
我正在嘗試使用基於相關個人的信息填寫遺漏案例(夫婦)。R - 根據配對數據填寫缺失信息
我的數據是這樣的
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 Not eligible
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
所以hserial
是夫婦標識符。第8行缺少一個案例Not eligible
,但信息可從合作伙伴處獲得(第7行)。
我想找到一個簡潔的方式來填補這些失蹤與合作伙伴的信息。
我的想法做一些像
library(dplyr)
childsum = dta %>% group_by(hserial, sex, children) %>%
summarise(n = n()) %>% spread(sex, children)
我會得到
hserial n Male Female
1 1001041 1 Yes Yes
2 1001061 1 No No
3 1001091 1 Yes Yes
4 1001151 1 No Not eligible
5 1001161 1 Yes Yes
然後,我可以做類似
childsum$Male = ifelse(childsum$Male == 'Not eligible', childsum$Female, childsum$Male)
childsum$Female = ifelse(childsum$Female == 'Not eligible', childsum$Male, childsum$Female)
所以對於每一個失蹤的Male
與Female
信息填寫和反之亦然。 然後合併回結果,以獲得
hserial sex age children
1 1001041 Male 30 Yes
2 1001041 Female 32 Yes
3 1001061 Male 22 No
4 1001061 Female 21 No
5 1001091 Male 38 Yes
6 1001091 Female 37 Yes
7 1001151 Male 31 No
8 1001151 Female 27 No
9 1001161 Male 33 Yes
10 1001161 Female 35 Yes
不知道如何做,這是一種巧妙的方法?
dta = structure(list(hserial = c(1001041, 1001041, 1001061, 1001061,
1001091, 1001091, 1001151, 1001151, 1001161, 1001161), sex = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("Male", "Female"
), class = "factor"), age = c(30, 32, 22, 21, 38, 37, 31, 27,
33, 35), children = structure(c(5L, 5L, 6L, 6L, 5L, 5L, 6L, 4L,
5L, 5L), .Label = c("DNA Does not apply", "NA No answer", "NA No answer",
"Not eligible", "Yes", "No"), class = "factor")), class = "data.frame", .Names = c("hserial",
"sex", "age", "children"), row.names = c(NA, -10L))