如何用R的子集中的平均值替換NA（用plyr？估算）

我有一個數據幀，其中包含來自蠑螈膽量的各種節肢動物的長度和寬度。因爲有些膽量有成千上萬的獵物，我只測量了每種獵物類型的一個子集。我現在想用每個未測量的個體替換那個獵物的平均長度和寬度。我想保留數據框並只添加估算列（length2，width2）。主要原因是每行也有列有收集蠑螈的日期和地點的數據。我可以隨機選擇一個被測量的個體來填充NA，但爲了論證的緣故，我們假設我只想用平均值替換每個NA。如何用R的子集中的平均值替換NA（用plyr？估算）

例如想象我有一個數據幀，看起來像：

id taxa  length width 
101 collembola 2.1  0.9 
102 mite  0.9  0.7 
103 mite  1.1  0.8 
104 collembola NA  NA 
105 collembola 1.5  0.5 
106 mite  NA  NA

在現實中我有更多的列和25不同的類羣和總總〜30000個獵物的。這似乎是plyr包可能是理想的，但我無法弄清楚如何做到這一點。我不是很R或編程精明，但我試圖學習。

不是說我知道我在做什麼，但我會嘗試創建一個小數據集以供玩耍，如果它有幫助。

exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), 
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))

這裏有一些事情我已經嘗試過（即沒有工作）：

# mean imputation to recode NA in length and width with means 
    (could do random imputation but unnecessary here) 
mean.imp <- function(x) { 
    missing <- is.na(x) 
    n.missing <-sum(missing) 
    x.obs <-a[!missing] 
    imputed <- x 
    imputed[missing] <- mean(x.obs) 
    return (imputed) 
    } 

mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"]) 

n.taxa <- length(unique(exampleDF$taxa)) 
for(i in 1:n.taxa) { 
    mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"]) 
} # no way to get back into dataframe in proper places, try plyr?

另一種嘗試：

imp.mean <- function(x) { 
    a <- mean(x, na.rm = TRUE) 
    return (ifelse (is.na(x) == TRUE , a, x)) 
} # tried but not sure how to use this in ddply 

Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) { 
    a <- mean(exampleDF$length, na.rm = TRUE) 
    return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) 
    })

任何建議使用plyr與否？

來源

2012-02-17 djhocking

您應該考慮用於計算值的包*鼠標*。 – 2012-02-17 04:51:24

'mi'包裝也相當不錯。 'Amelia'比'mice'或'mi'快得多，但它確實依賴於你的變量是多元正態的 – richiemorrisroe 2012-02-17 09:19:40

不是我自己的技術，我看到了它的籃板而回：

dat <- read.table(text = "id taxa  length width 
101 collembola 2.1  0.9 
102 mite  0.9  0.7 
103 mite  1.1  0.8 
104 collembola NA  NA 
105 collembola 1.5  0.5 
106 mite  NA  NA", header=TRUE) 


library(plyr) 
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)) 
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length), 
    width = impute.mean(width)) 

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

編輯與for環路非plyr方法：

for (i in which(sapply(dat, is.numeric))) { 
    for (j in which(is.na(dat[, i]))) { 
     dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i], na.rm = TRUE) 
    } 
}

編輯經過許多日子這裏是一個data.table & dplyr ap proach：

data.table

library(data.table) 
setDT(dat) 

dat[, length := impute.mean(length), by = taxa][, 
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr) 

dat %>% 
    group_by(taxa) %>% 
    mutate(
     length = impute.mean(length), 
     width = impute.mean(width) 
    )

來源

2012-02-17 04:38:15

@djhocking謝謝Hadley我發現了我偷了它的地方：[（LINK）]（http：// www.mail-archive.com/[email protected]/msg58289.html） – 2012-02-17 05:35:16

前回答這個問題，我想說的是，我在河初學者因此，請讓我知道如果你覺得我的回答是錯誤的。

代碼：

DF[is.na(DF$length), "length"] <- mean(na.omit(telecom_original_1$length))

並應用相同的用於寬度。

DF代表data.frame的名稱。

感謝， Parthi

來源

2015-09-02 14:10:14 parthiban

擴展在@Tyler林克的解決方案，假設features都歸咎於列。在這種情況下features <- c('length', 'width')。然後使用data.table解決方案變爲：

library(data.table) 
setDT(dat) 

dat[, (features) := lapply(.SD, impute.mean), by = taxa, .SDcols = features]

來源

2017-01-07 03:59:27

如何用R的子集中的平均值替換NA（用plyr？估算）

回答

相關問題