2017-05-27 151 views
0

我正在嘗試通過應用預測建模(max kuhn)一書中的示例。這是創建校準曲線的一個例子。
我有點理解那條曲線的重點,即看實際事件的比例是否與預測事件相似。但我正在努力瞭解如何計算輸出的百分比列。
下面是代碼:r - calibration()函數如何計算觀察的均勻百分比

library(AppliedPredictiveModeling) 
set.seed(975) 
simulatedTrain <- quadBoundaryFunc(500) 
simulatedTest <- quadBoundaryFunc(1000) 


# Random forest 

library(randomForest) 
rfModel <- randomForest(class ~ X1 + X2, 
         data = simulatedTrain, 
         ntree = 2000) 


rfTestPred <- predict(rfModel, simulatedTest, type = "prob") 

simulatedTest$RFprob <- rfTestPred[,"Class1"] 
simulatedTest$RFclass <- predict(rfModel, simulatedTest) 

library(caret) 

# Calibrating probabilities 
calCurve <- calibration(x = class ~ RFprob, data = simulatedTest) 
calCurve$data 




calibModelVar   bin Percent  Lower  Upper Count midpoint 
1   RFprob  [0,0.0909] 4.00000 2.203804 6.620306 14 4.545455 
2   RFprob (0.0909,0.182] 20.00000 11.648215 30.832609 15 13.636364 
3   RFprob (0.182,0.273] 33.33333 20.395974 48.410832 16 22.727273 
4   RFprob (0.273,0.364] 37.20930 22.975170 53.274905 16 31.818182 
5   RFprob (0.364,0.455] 35.71429 18.640666 55.934969 10 40.909091 
6   RFprob (0.455,0.545] 53.19149 38.077789 67.888473 25 50.000000 
7   RFprob (0.545,0.636] 65.71429 47.789002 80.867590 23 59.090909 
8   RFprob (0.636,0.727] 72.50000 56.111709 85.399101 29 68.181818 
9   RFprob (0.727,0.818] 83.33333 67.188407 93.627987 30 77.272727 
10  RFprob (0.818,0.909] 95.83333 85.745903 99.491353 46 86.363636 
11  RFprob  (0.909,1] 94.00000 90.296922 96.603304 235 95.454545 

因此,如果我們使用的第一行作爲一個例子,什麼是Count = 14說明什麼? 據我所見,有14行的RF計算概率介於0-10%(四捨五入)和實際類別之間的差異爲Class1

nrow(simulatedTest[simulatedTest$RFprob >=0 & simulatedTest$RFprob <=0.0909 & simulatedTest$class == "Class1",]) 

當我繪製圖表

xyplot(calCurve, auto.key = list(columns =2)) 

在X軸我明白,這是midpoint柱的bin的中點。並且y軸是Percent列。 但是如何計算Percent列?

enter image description here

回答

0

calibrationPercent柱的計算如下進行。首先,預測的概率被分成11個等間隔的間隔。

simulatedTest$bin <- cut(simulatedTest$RFprob, 
         breaks=seq(0,1,length.out=12), 
         include.lowest=T) 
table(simulatedTest$bin) 

    [0,0.0909] (0.0909,0.182] (0.182,0.273] (0.273,0.364] (0.364,0.455] 
      350    75    48    43    28 
(0.455,0.545] (0.545,0.636] (0.636,0.727] (0.727,0.818] (0.818,0.909] 
      47    35    40    36    48 
    (0.909,1] 
      250 

Count可以使用簡單table來計算。

(tbl <- table(simulatedTest$bin,simulatedTest$class)) 

       Class1 Class2 
    [0,0.0909]   14 336 
    (0.0909,0.182]  15  60 
    (0.182,0.273]  16  32 
    (0.273,0.364]  16  27 
    (0.364,0.455]  10  18 
    (0.455,0.545]  25  22 
    (0.545,0.636]  23  12 
    (0.636,0.727]  29  11 
    (0.727,0.818]  30  6 
    (0.818,0.909]  46  2 
    (0.909,1]   235  15 

Percent列包含tbl行比例:

round(prop.table(tbl,1)*100,2) 

        Class1 Class2 
    [0,0.0909]  4.000000 96.000000 
    (0.0909,0.182] 20.000000 80.000000 
    (0.182,0.273] 33.333333 66.666667 
    (0.273,0.364] 37.209302 62.790698 
    (0.364,0.455] 35.714286 64.285714 
    (0.455,0.545] 53.191489 46.808511 
    (0.545,0.636] 65.714286 34.285714 
    (0.636,0.727] 72.500000 27.500000 
    (0.727,0.818] 83.333333 16.666667 
    (0.818,0.909] 95.833333 4.166667 
    (0.909,1]  94.000000 6.000000 

calibration使用binom.test計算這些比例的置信區間:

t(apply(tbl, 1, function(x) { 
    bintst <- binom.test(x=x[1], n=sum(x)) 
    round(100*c(bintst$estimate,bintst$conf.int),6) 
    })) 

       probability of success      
    [0,0.0909]     4.00000 2.203804 6.620306 
    (0.0909,0.182]    20.00000 11.648215 30.832609 
    (0.182,0.273]    33.33333 20.395974 48.410832 
    (0.273,0.364]    37.20930 22.975170 53.274905 
    (0.364,0.455]    35.71429 18.640666 55.934969 
    (0.455,0.545]    53.19149 38.077789 67.888473 
    (0.545,0.636]    65.71429 47.789002 80.867590 
    (0.636,0.727]    72.50000 56.111709 85.399101 
    (0.727,0.818]    83.33333 67.188407 93.627987 
    (0.818,0.909]    95.83333 85.745903 99.491353 
    (0.909,1]     94.00000 90.296922 96.603304 

calibration所有這些計算都是由執行caret:::calibCalc功能。
我希望它能幫助你。