2015-06-21 63 views
1

我要定義caret我的自定義指標的功能,但這個功能我想使用未用於訓練的其他信息。 我因此需要具有在此倍進行驗證所使用的數據的索引(行數)。訪問索引中插入符號

這是一個愚蠢的例子:

生成數據:

library(caret) 
set.seed(1234) 

x <- matrix(rnorm(10),nrow=5,ncol=2) 
y <- factor(c("y","n","y","y","n")) 

priors <- c(1,3,2,7,9) 

這是我的榜樣度量函數,它應該使用從priors矢量

my.metric <- function (data, 
        lev = NULL, 
        model = NULL) { 
      out <- priors[-->INDICES.OF.DATA<--] + data$pred/data$obs 
      names(out) <- "MYMEASURE" 
      out 
} 

myControl <- trainControl(summaryFunction = my.metricm, method="repeatedcv", number=10, repeats=2) 

fit <- train(y=y,x=x, metric = "MYMEASURE",method="gbm", trControl = mControl) 

信息,使這也許顯得更加清晰,我可以生存使用此設置,其中priors是天,用這個在Surv對象來衡量蘇度量函數中的活動AUC。

我怎樣才能做到這一點在插入符號?

+0

Khl4v的答案是完美的。 ** spore234:**如果您有興趣在包裝中使用生存模型,請與我聯繫(Max Kuhn;我維護'caret')。我們正在指出這將如何工作,我想要一些輸入。 – topepo

回答

1

您可以訪問使用data$rowIndex行號。請注意,彙總函數應該返回一個單一的數字作爲其度量標準(例如ROC,Accuracy,RMSE ...)。上述函數似乎返回一個長度等於保持的CV數據中觀察值數量的向量。

如果你有興趣在看到與他們的預測沿重複採樣,您可以添加print(data)my.metric功能。

下面是使用你的數據Metrics::auc預測的類概率與現有相乘之後的示例(擴大一點),並作爲業績衡量:

library(caret) 
library(Metrics) 

set.seed(1234) 
x <- matrix(rnorm(100), nrow=100, ncol=2) 
set.seed(1234) 
y <- factor(sample(x = c("y", "n"), size = 100, replace = T)) 

priors <- runif(n = length(y), min = 0.1, max = 0.9) 

my.metric <- function(data, lev = NULL, model = NULL) 
{ 
    # The performance metric should be a single number 
    # data$y are the predicted probabilities of 
    # the observations in the fold belonging to class "y" 
    out <- Metrics::auc(actual = as.numeric(data$obs == "y"), 
         predicted = priors[data$rowIndex] * data$y) 
    names(out) <- "MYMEASURE" 
    out 
} 

fitControl <- trainControl(method = "repeatedcv", 
          number = 10, 
          classProbs = T, 
          repeats = 2, 
          summaryFunction = my.metric) 

set.seed(1234) 
fit <- train(y = y, 
      x = x, 
      metric = "MYMEASURE", 
      method="gbm", 
      verbose = FALSE, 
      trControl = fitControl) 
fit 

# Stochastic Gradient Boosting 
# 
# 100 samples 
# 2 predictor 
# 2 classes: 'n', 'y' 
# 
# No pre-processing 
# Resampling: Cross-Validated (10 fold, repeated 2 times) 
# 
# Summary of sample sizes: 90, 90, 90, 90, 90, 89, ... 
# 
# Resampling results across tuning parameters: 
#  
# interaction.depth n.trees MYMEASURE MYMEASURE SD 
# 1     50  0.5551667 0.2348496 
# 1     100  0.5682500 0.2297383 
# 1     150  0.5797500 0.2274042 
# 2     50  0.5789167 0.2246845 
# 2     100  0.5941667 0.2053826 
# 2     150  0.5900833 0.2186712 
# 3     50  0.5750833 0.2291999 
# 3     100  0.5488333 0.2312470 
# 3     150  0.5577500 0.2202638 
# 
# Tuning parameter 'shrinkage' was held constant at a value of 0.1 
# Tuning parameter 'n.minobsinnode' was held constant at a value of 10 
# MYMEASURE was used to select the optimal model using the largest value. 

我不知道太多關於生存分析,但我希望這可以幫助。