我想用caret::train
來訓練一個repeatedcv
程序的隨機森林模型。我的數據有一些缺失值,所以我想在列車功能中使用preProcess="bagImpute"
選項。我不想在列車外使用preProcess
功能,因爲我想我的數據用於repeatedcv
程序的每次迭代。但是,當我嘗試執行此操作時,會引發錯誤:在caret :: train功能中使用bagImpute預處理時缺失值錯誤
Error in { : task 1 failed - "'n' must be a positive integer >= 'x'"
In addition: There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In eval(expr, envir, enclos) :
model fit failed for Fold01.Rep01: mtry=2 Error in na.fail.default(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, :
missing values in object
下面是使用虹膜數據的最小可重現示例。我在他的網站上借用了Minkoo數據集準備的初始代碼:http://mkseo.pe.kr/stats/?p=719。非常感謝Minkoo!
library(caret)
data(iris)
inTrain <- createDataPartition(iris$Species, p=0.8, list=FALSE)
training <- iris[inTrain, ]
fillInNa <- function(d) {
naCount <- NROW(d) * 0.1
for (i in sample(NROW(d), naCount)) {
d[i, sample(4, 1)] <- NA
}
return(d)
}
training <- fillInNa(training)
tc<-trainControl("repeatedcv", repeats=30, selectionFunction="oneSE",returnData=T,
classProbs = T,num=10, preProcOptions ="bagImpute",
summaryFunction=multiClassSummary, savePredictions = T)
training.x<-training[,1:4]
training.y<-training[,5]
rfTri_Bag<- train(training.x,training.y,
method="rf",
trControl=tc,
preProcess= c("bagImpute"),
tuneLength=10,
control=rpart.control(usesurrogate=0),
ntree=250,
proximity=T)
編輯:這是我的會議信息:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_UnitedStates.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] ipred_0.9-5 e1071_1.6-7 latticeExtra_0.6-28 RColorBrewer_1.1-2 randomForest_4.6-12 caret_6.0-71
[7] rpart_4.1-10 party_1.0-25 strucchange_1.5-1 sandwich_2.3-4 zoo_1.7-13 modeltools_0.2-21
[13] mvtnorm_1.0-5 gdata_2.17.0 DMwR_0.4.1 pROC_1.8 Metrics_0.1.1 raster_2.5-8
[19] sp_1.2-3 gridExtra_2.2.1 readr_1.0.0 tidyr_0.6.0 tibble_1.2 tidyverse_1.0.0
[25] MuMIn_1.15.6 merTools_0.2.2 devtools_1.12.0 plyr_1.8.4 arm_1.9-1 lattice_0.20-33
[31] MASS_7.3-45 xtable_1.8-2 lmerTest_2.0-32 lme4_1.1-12 Matrix_1.2-6 xlsx_0.5.7
[37] xlsxjars_0.6.1 rJava_0.9-8 AICcmodavg_2.0-4 pander_0.6.0 ggplot2_2.1.0 purrr_0.2.2
[43] dplyr_0.5.0 broom_0.4.1
loaded via a namespace (and not attached):
[1] TH.data_1.0-7 VGAM_1.0-2 minqa_1.2.4 colorspace_1.2-6 class_7.3-14 MatrixModels_0.4-1
[7] DT_0.2 prodlim_1.5.7 coin_1.1-2 codetools_0.2-14 splines_3.3.1 mnormt_1.5-4
[13] knitr_1.14 Formula_1.2-1 nloptr_1.0.4 pbkrtest_0.4-6 cluster_2.0.4 shiny_0.14
[19] compiler_3.3.1 httr_1.2.1 assertthat_0.1 lazyeval_0.2.0 acepack_1.3-3.3 htmltools_0.3.5
[25] quantreg_5.29 tools_3.3.1 coda_0.18-1 gtable_0.2.0 reshape2_1.4.1 Rcpp_0.12.7
[31] nlme_3.1-128 iterators_1.0.8 psych_1.6.6 stringr_1.1.0 mime_0.5 gtools_3.5.0
[37] scales_0.4.0 parallel_3.3.1 SparseM_1.7 yaml_2.1.13 quantmod_0.4-6 curl_1.2
[43] memoise_1.0.0 reshape_0.8.5 stringi_1.1.1 foreach_1.4.3 blme_1.0-4 TTR_0.23-1
[49] caTools_1.17.1 boot_1.3-18 lava_1.4.4 chron_2.3-47 bitops_1.0-6 evaluate_0.9
[55] ROCR_1.0-7 htmlwidgets_0.7 labeling_0.3 magrittr_1.5 R6_2.1.3 gplots_3.0.1
[61] Hmisc_3.17-4 multcomp_1.4-6 DBI_0.5 foreign_0.8-66 withr_1.0.2 mgcv_1.8-12
[67] xts_0.9-7 survival_2.39-4 abind_1.4-5 nnet_7.3-12 car_2.1-3 KernSmooth_2.23-15
[73] rmarkdown_1.0 data.table_1.9.6 git2r_0.15.0 digest_0.6.10 httpuv_1.3.3 munsell_0.4.3
[79] unmarked_0.11-0
編輯2:一個幾乎相同的問題已經被問這裏https://stackoverflow.com/a/20081954/5617640,而只是給出了答案演示瞭如何從一個preProcess()
對象外界預測train()
功能的。正如@Misconstruction在評論中指出的那樣,用這種方法,插補是「不包含在CV循環中的」。 - 我的想法確切。
'multiClassSummary'從哪裏來? –
@ Hack-R https://github.com/topepo/caret/issues/107 – jlab