2012-01-03 39 views
5

如何在權重不均勻時將權重合併到minsplit條件rpart中? 我無法找到minsplit閾值將權重考慮在內的方法,並且權重不均勻時,就成爲一個問題,如以下示例所示。 我目前的解決方法是將數據擴展到其中每行都是觀察值的數據,但這在時間和內存中似乎都是浪費的(我懷疑我是否可以在擴展形式中保留需要在內存中處理的真實數據集),因此 - 尋求幫助。 在此先感謝您的幫助, -Sarar在分期中使用分號和不相等的權重

以下代碼顯示了問題所在;前三棵樹相同,但以下兩種(不平衡重量)結果不同:

## playing with rpart weights 
require(rpart) 
dev.new() 
par(mfrow=c(2,3), xpd=NA) 
data(kyphosis) 

fitOriginal <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, control=rpart.control(minsplit=15)) 
plot(fitOriginal) 
text(fitOriginal, use.n=TRUE) 

# this dataset is the original data repeated 3 times 
kyphosisRepeated <- rbind(kyphosis, kyphosis, kyphosis) 
fitRepeated <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisRepeated, control=rpart.control(minsplit=45)) 
plot(fitRepeated) 
text(fitRepeated, use.n=TRUE) 

# instead of repeating, use weights 
kyphosisWeighted <- kyphosis 
kyphosisWeighted$myWeights <- 3 
fitWeighted <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisWeighted, weights=myWeights, 
    control=rpart.control(minsplit=15))  ## minsplit has to be adjusted for weights... 
plot(fitWeighted) 
text(fitWeighted, use.n=TRUE) 

# uneven weights don't works the same way 
kyphosisUnevenWeights <- rbind(kyphosis, kyphosis) 
kyphosisUnevenWeights$myWeights <- c(rep(1,length.out=nrow(kyphosis)), rep(2,length.out=nrow(kyphosis))) 

fitUneven15 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
    control=rpart.control(minsplit=15)) 
plot(fitUneven15) 
text(fitUneven15, use.n=TRUE) 

fitUneven45 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
    control=rpart.control(minsplit=45)) 
plot(fitUneven45) 
text(fitUneven45, use.n=TRUE) 

## 30 works, but seems like a special case 
fitUneven30 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
    control=rpart.control(minsplit=30)) 
plot(fitUneven30) 
text(fitUneven30, use.n=TRUE) 

回答

0

這裏沒有問題。如果你使用的數據集是原始數據集的兩倍,然後要求minsplit是原始minsplit的3倍,那麼當然你會增長一棵較短的樹(假設權重之間的相關性保持不變)。如果你保持體重相同,並且minsplit/n的比率也相同,請參閱這些修訂後的例子,這些例子表明你將種植相同的同一棵樹。

## playing with rpart weights 
require(rpart) 
dev.new() 
par(mfrow=c(2,2), xpd=NA) 
data(kyphosis) 

# this dataset is the original data repeated 2 times############################################################ 
# without weights 
kyphosisRepeated <- rbind(kyphosis, kyphosis) 
fitRepeated <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisRepeated, control=rpart.control(minsplit=30)) 
plot(fitRepeated) 
text(fitRepeated, use.n=TRUE) 

# with weights 
kyphosisUnevenWeights <- rbind(kyphosis, kyphosis) 
kyphosisUnevenWeights$myWeights <- c(rep(1,length.out=nrow(kyphosis)), rep(2,length.out=nrow(kyphosis))) 

fitUneven30 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
        control=rpart.control(minsplit=30)) 
plot(fitUneven30) 
text(fitUneven30, use.n=TRUE) 
################################################################################################################ 

# this dataset is the original data repeated 3 times 
# without weights 
kyphosisRepeated <- rbind(kyphosis, kyphosis, kyphosis) 
fitRepeated <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisRepeated, control=rpart.control(minsplit=45)) 
plot(fitRepeated) 
text(fitRepeated, use.n=TRUE) 

# with weights 
kyphosisUnevenWeights <- rbind(kyphosis, kyphosis, kyphosis) 
kyphosisUnevenWeights$myWeights <- c(rep(1,length.out=nrow(kyphosis)), rep(2,length.out=nrow(kyphosis)), rep(3,length.out=nrow(kyphosis))) 

fitUneven45 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosisUnevenWeights, weights=myWeights, 
        control=rpart.control(minsplit=45)) 
plot(fitUneven45) 
text(fitUneven45, use.n=TRUE) 

關於RPart更多詳細信息,請參閱this blog post

+0

我正試圖在一般數據集上使用不平衡權重和minsplit,並且該示例顯示它不起作用。平衡權重不是一個通用的解決方案,它可能會導致數據集太大。 – Saar 2014-09-26 03:18:24

+0

@薩爾,我很抱歉如果我失去了明顯的東西。你說這些例子顯示「它不起作用」。它以什麼方式不起作用?當我測試這些例子時,每棵樹上都增長了一棵樹,沒有任何錯誤。有沒有一棵樹以你沒想到的方式生長? – Ben 2014-09-26 05:22:44

+0

在所有6個例子中,數據都是相同的數據,用不同的方式表示(除了第一個例子):它要麼重複三次每個觀察結果,要麼出現一次但權重爲3,要麼出現兩次,權重合計爲3 我期望從它構建的樹是相同的樹(相同的數據,相同的算法,相同的條件應該導致相同的輸出)。具體來說,第五個例子應該給我與第二個和第三個例子相同的樹。它沒有。 這不是關於運行時錯誤,而是關於獲取錯誤答案...... – Saar 2014-09-26 11:18:23