0
我正在嘗試通過應用預測建模(max kuhn)一書中的示例。這是創建校準曲線的一個例子。
我有點理解那條曲線的重點,即看實際事件的比例是否與預測事件相似。但我正在努力瞭解如何計算輸出的百分比列。
下面是代碼:r - calibration()函數如何計算觀察的均勻百分比
library(AppliedPredictiveModeling)
set.seed(975)
simulatedTrain <- quadBoundaryFunc(500)
simulatedTest <- quadBoundaryFunc(1000)
# Random forest
library(randomForest)
rfModel <- randomForest(class ~ X1 + X2,
data = simulatedTrain,
ntree = 2000)
rfTestPred <- predict(rfModel, simulatedTest, type = "prob")
simulatedTest$RFprob <- rfTestPred[,"Class1"]
simulatedTest$RFclass <- predict(rfModel, simulatedTest)
library(caret)
# Calibrating probabilities
calCurve <- calibration(x = class ~ RFprob, data = simulatedTest)
calCurve$data
calibModelVar bin Percent Lower Upper Count midpoint
1 RFprob [0,0.0909] 4.00000 2.203804 6.620306 14 4.545455
2 RFprob (0.0909,0.182] 20.00000 11.648215 30.832609 15 13.636364
3 RFprob (0.182,0.273] 33.33333 20.395974 48.410832 16 22.727273
4 RFprob (0.273,0.364] 37.20930 22.975170 53.274905 16 31.818182
5 RFprob (0.364,0.455] 35.71429 18.640666 55.934969 10 40.909091
6 RFprob (0.455,0.545] 53.19149 38.077789 67.888473 25 50.000000
7 RFprob (0.545,0.636] 65.71429 47.789002 80.867590 23 59.090909
8 RFprob (0.636,0.727] 72.50000 56.111709 85.399101 29 68.181818
9 RFprob (0.727,0.818] 83.33333 67.188407 93.627987 30 77.272727
10 RFprob (0.818,0.909] 95.83333 85.745903 99.491353 46 86.363636
11 RFprob (0.909,1] 94.00000 90.296922 96.603304 235 95.454545
因此,如果我們使用的第一行作爲一個例子,什麼是Count = 14
說明什麼? 據我所見,有14行的RF計算概率介於0-10%(四捨五入)和實際類別之間的差異爲Class1
。
nrow(simulatedTest[simulatedTest$RFprob >=0 & simulatedTest$RFprob <=0.0909 & simulatedTest$class == "Class1",])
當我繪製圖表
xyplot(calCurve, auto.key = list(columns =2))
在X軸我明白,這是midpoint
柱的bin的中點。並且y軸是Percent
列。 但是如何計算Percent
列?