2016-09-21 123 views
1

我想用caret::train來訓練一個repeatedcv程序的隨機森林模型。我的數據有一些缺失值,所以我想在列車功能中使用preProcess="bagImpute"選項。我不想在列車外使用preProcess功能,因爲我想我的數據用於repeatedcv程序的每次迭代。但是,當我嘗試執行此操作時,會引發錯誤:在caret :: train功能中使用bagImpute預處理時缺失值錯誤

Error in { : task 1 failed - "'n' must be a positive integer >= 'x'" 
In addition: There were 50 or more warnings (use warnings() to see the first 50) 
> warnings() 
Warning messages: 
1: In eval(expr, envir, enclos) : 
    model fit failed for Fold01.Rep01: mtry=2 Error in na.fail.default(structure(list(Sepal.Length = c(5.1, 4.9, 4.7, : 
    missing values in object 

下面是使用虹膜數據的最小可重現示例。我在他的網站上借用了Minkoo數據集準備的初始代碼:http://mkseo.pe.kr/stats/?p=719。非常感謝Minkoo!

library(caret) 

data(iris) 
inTrain <- createDataPartition(iris$Species, p=0.8, list=FALSE) 
training <- iris[inTrain, ] 


fillInNa <- function(d) { 
     naCount <- NROW(d) * 0.1 
     for (i in sample(NROW(d), naCount)) { 
      d[i, sample(4, 1)] <- NA 
     } 
     return(d) 
} 

training <- fillInNa(training) 

tc<-trainControl("repeatedcv", repeats=30, selectionFunction="oneSE",returnData=T, 
classProbs = T,num=10, preProcOptions ="bagImpute", 
summaryFunction=multiClassSummary, savePredictions = T) 

training.x<-training[,1:4] 
training.y<-training[,5] 

rfTri_Bag<- train(training.x,training.y, 
       method="rf", 
       trControl=tc, 
       preProcess= c("bagImpute"), 
       tuneLength=10, 
       control=rpart.control(usesurrogate=0), 
       ntree=250, 
       proximity=T) 

編輯:這是我的會議信息:

R version 3.3.1 (2016-06-21) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows 7 x64 (build 7601) Service Pack 1 

locale: 
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_UnitedStates.1252 LC_MONETARY=English_United States.1252 
[4] LC_NUMERIC=C LC_TIME=English_United States.1252  

attached base packages: 
[1] stats4 grid  stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] ipred_0.9-5   e1071_1.6-7   latticeExtra_0.6-28 RColorBrewer_1.1-2 randomForest_4.6-12 caret_6.0-71  
[7] rpart_4.1-10  party_1.0-25  strucchange_1.5-1 sandwich_2.3-4  zoo_1.7-13   modeltools_0.2-21 
[13] mvtnorm_1.0-5  gdata_2.17.0  DMwR_0.4.1   pROC_1.8   Metrics_0.1.1  raster_2.5-8  
[19] sp_1.2-3   gridExtra_2.2.1  readr_1.0.0   tidyr_0.6.0   tibble_1.2   tidyverse_1.0.0  
[25] MuMIn_1.15.6  merTools_0.2.2  devtools_1.12.0  plyr_1.8.4   arm_1.9-1   lattice_0.20-33  
[31] MASS_7.3-45   xtable_1.8-2  lmerTest_2.0-32  lme4_1.1-12   Matrix_1.2-6  xlsx_0.5.7   
[37] xlsxjars_0.6.1  rJava_0.9-8   AICcmodavg_2.0-4 pander_0.6.0  ggplot2_2.1.0  purrr_0.2.2   
[43] dplyr_0.5.0   broom_0.4.1   

loaded via a namespace (and not attached): 
[1] TH.data_1.0-7  VGAM_1.0-2   minqa_1.2.4  colorspace_1.2-6 class_7.3-14  MatrixModels_0.4-1 
[7] DT_0.2    prodlim_1.5.7  coin_1.1-2   codetools_0.2-14 splines_3.3.1  mnormt_1.5-4  
[13] knitr_1.14   Formula_1.2-1  nloptr_1.0.4  pbkrtest_0.4-6  cluster_2.0.4  shiny_0.14   
[19] compiler_3.3.1  httr_1.2.1   assertthat_0.1  lazyeval_0.2.0  acepack_1.3-3.3 htmltools_0.3.5 
[25] quantreg_5.29  tools_3.3.1  coda_0.18-1  gtable_0.2.0  reshape2_1.4.1  Rcpp_0.12.7  
[31] nlme_3.1-128  iterators_1.0.8 psych_1.6.6  stringr_1.1.0  mime_0.5   gtools_3.5.0  
[37] scales_0.4.0  parallel_3.3.1  SparseM_1.7  yaml_2.1.13  quantmod_0.4-6  curl_1.2   
[43] memoise_1.0.0  reshape_0.8.5  stringi_1.1.1  foreach_1.4.3  blme_1.0-4   TTR_0.23-1   
[49] caTools_1.17.1  boot_1.3-18  lava_1.4.4   chron_2.3-47  bitops_1.0-6  evaluate_0.9  
[55] ROCR_1.0-7   htmlwidgets_0.7 labeling_0.3  magrittr_1.5  R6_2.1.3   gplots_3.0.1  
[61] Hmisc_3.17-4  multcomp_1.4-6  DBI_0.5   foreign_0.8-66  withr_1.0.2  mgcv_1.8-12  
[67] xts_0.9-7   survival_2.39-4 abind_1.4-5  nnet_7.3-12  car_2.1-3   KernSmooth_2.23-15 
[73] rmarkdown_1.0  data.table_1.9.6 git2r_0.15.0  digest_0.6.10  httpuv_1.3.3  munsell_0.4.3  
[79] unmarked_0.11-0 

編輯2:一個幾乎相同的問題已經被問這裏https://stackoverflow.com/a/20081954/5617640,而只是給出了答案演示瞭如何從一個preProcess()對象外界預測train()功能的。正如@Misconstruction在評論中指出的那樣,用這種方法,插補是「不包含在CV循環中的」。 - 我的想法確切。

+0

'multiClassSummary'從哪裏來? –

+0

@ Hack-R https://github.com/topepo/caret/issues/107 – jlab

回答

0

這不是錯誤消息的解決方案,但希望能解決您的問題。

如果您運行的是隨機森林模型,它在本質上與交叉驗證(OOB)錯誤估計意義上的「交叉驗證」本身有關。有使用的看到了這一點伯克利文章中隨機森林時無需任何類型的交叉驗證:

在隨機森林,就沒有必要進行交叉驗證或單獨的測試設置得到一個公正的估計測試集錯誤,在內部估計,在運行過程中...「(https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm