2017-10-11 91 views
1

我試圖在xgboost中實現一個提升的泊松迴歸模型,但是我發現結果偏向於低頻率。爲了說明,下面是一些最起碼的Python代碼,我認爲複製問題:xgboost中的泊松迴歸對於低頻失敗

import numpy as np 
import pandas as pd 
import xgboost as xgb 

def get_preds(mult): 
    # generate toy dataset for illustration 
    # 4 observations with linearly increasing frequencies 
    # the frequencies are scaled by `mult` 
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), 
         label=[i*mult for i in [1, 2, 3, 4]], 
         weight=[1000, 1000, 1000, 1000]) 

    # train a poisson booster on the toy data 
    bst = xgb.train(
     params={"objective": "count:poisson"}, 
     dtrain=dmat, 
     num_boost_round=100000, 
     early_stopping_rounds=5, 
     evals=[(dmat, "train")], 
     verbose_eval=False) 

    # return fitted frequencies after reversing scaling 
    return bst.predict(dmat)/mult 

# test multipliers in the range [10**(-8), 10**1] 
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)] 
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0)) 
df.index = mults 
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"] 
df 

# --- result --- 
#    (0, 0) (0, 1) (1, 0) (1, 1) 
#1.000000e-08 11598.0 11598.0 11598.0 11598.0 
#1.000000e-07 1161.0 1161.0 1161.0 1161.0 
#1.000000e-06 118.0 118.0 118.0 118.0 
#1.000000e-05  12.0  12.0  12.0  12.0 
#1.000000e-04  2.0  2.0  3.0  3.0 
#1.000000e-03  1.0  2.0  3.0  4.0 
#1.000000e-02  1.0  2.0  3.0  4.0 
#1.000000e-01  1.0  2.0  3.0  4.0 
#1.000000e+00  1.0  2.0  3.0  4.0 

注意,在低頻率的預測似乎炸燬。這可能與Poisson lambda *重量下降到1以下有關(並且實際上將重量增加到1000以上確實將「爆炸」轉變爲較低頻率),但是我仍然期望預測接近平均訓練頻率(2.5)。另外(在上面的例子中沒有顯示),減少eta似乎增加了預測中的偏差量。

什麼會導致這種情況發生?是否有一個可以減輕這種影響的參數?

回答

0

經過一番挖掘,我找到了一個解決方案。如果有人遇到同樣的問題,請在此處記錄。事實證明,我需要添加一個等於平均頻率(自然)對數的偏移項。如果這並不明顯,那是因爲最初的預測開始於0.5的頻率,並且需要許多提升迭代來將預測重新調整到平均頻率。

請參閱下面的代碼,瞭解玩具示例的更新。正如我在原始問題中所建議的,現在的預測接近較低尺度的平均頻率(2.5)。

import numpy as np 
import pandas as pd 
import xgboost as xgb 

def get_preds(mult): 
    # generate toy dataset for illustration 
    # 4 observations with linearly increasing frequencies 
    # the frequencies are scaled by `mult` 
    dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]), 
         label=[i*mult for i in [1, 2, 3, 4]], 
         weight=[1000, 1000, 1000, 1000]) 

    ## adding an offset term equal to the log of the mean frequency 
    offset = np.log(np.mean([i*mult for i in [1, 2, 3, 4]])) 
    dmat.set_base_margin(np.repeat(offset, 4)) 

    # train a poisson booster on the toy data 
    bst = xgb.train(
     params={"objective": "count:poisson"}, 
     dtrain=dmat, 
     num_boost_round=100000, 
     early_stopping_rounds=5, 
     evals=[(dmat, "train")], 
     verbose_eval=False) 

    # return fitted frequencies after reversing scaling 
    return bst.predict(dmat)/mult 

# test multipliers in the range [10**(-8), 10**1] 
# display fitted frequencies 
mults = [10**i for i in range(-8, 1)] 
## round to 1 decimal point to show the result approaches 2.5 
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 1)) 
df.index = mults 
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"] 
df 

# --- result --- 
#    (0, 0) (0, 1) (1, 0) (1, 1) 
#1.000000e-08  2.5  2.5  2.5  2.5 
#1.000000e-07  2.5  2.5  2.5  2.5 
#1.000000e-06  2.5  2.5  2.5  2.5 
#1.000000e-05  2.5  2.5  2.5  2.5 
#1.000000e-04  2.4  2.5  2.5  2.6 
#1.000000e-03  1.0  2.0  3.0  4.0 
#1.000000e-02  1.0  2.0  3.0  4.0 
#1.000000e-01  1.0  2.0  3.0  4.0 
#1.000000e+00  1.0  2.0  3.0  4.0