2011-08-24 145 views
6

我正在嘗試libsvm,並按照該示例來訓練軟件附帶的heart_scale數據上的svm。我想使用我預先計算好的chi2內核。訓練數據的分類率降至24%。我確定我正確地計算了內核,但我想我必須做錯了什麼。代碼如下。你能看到任何錯誤嗎?幫助將不勝感激。使用預先計算的chi2內核與libsvm(matlab)不好的結果

%read in the data: 
[heart_scale_label, heart_scale_inst] = libsvmread('heart_scale'); 
train_data = heart_scale_inst(1:150,:); 
train_label = heart_scale_label(1:150,:); 

%read somewhere that the kernel should not be sparse 
ttrain = full(train_data)'; 
ttest = full(test_data)'; 

precKernel = chi2_custom(ttrain', ttrain'); 
model_precomputed = svmtrain2(train_label, [(1:150)', precKernel], '-t 4'); 

這是內核是如何預先計算:

function res=chi2_custom(x,y) 
a=size(x); 
b=size(y); 
res = zeros(a(1,1), b(1,1)); 
for i=1:a(1,1) 
    for j=1:b(1,1) 
     resHelper = chi2_ireneHelper(x(i,:), y(j,:)); 
     res(i,j) = resHelper; 
    end 
end 
function resHelper = chi2_ireneHelper(x,y) 
a=(x-y).^2; 
b=(x+y); 
resHelper = sum(a./(b + eps)); 

用不同的SVM實現(vlfeat)我得到的訓練數據的分類率(是的,我在訓練數據進行測試,只是看看發生了什麼)約90%。所以我很確定libsvm的結果是錯誤的。

回答

0

的問題是以下行:

resHelper = sum(a./(b + eps)); 

它應該是:

resHelper = 1-sum(2*a./(b + eps)); 
+0

謝謝你回答我的問題,我剛纔看到你的迴應了。 – Sallos

+0

@Sallos:雖然你的公式稍微偏離了,但真正的問題是數據正常化。看到我的回答 – Amro

15

當與支持向量機工作時,以歸一化數據集作爲一個預處理步驟是非常重要的。 標準化將屬性放在相同的比例尺上,並防止具有較大值的屬性偏置結果。它還提高了數值穩定性(最大限度地減少了由於浮點表示而導致的上溢和下溢的可能性)。

準確地說,您對卡方內核的計算稍微偏離。相反,採取的定義之下,並使用它這個更快的實現:

chi_squared_kernel

function D = chi2Kernel(X,Y) 
    D = zeros(size(X,1),size(Y,1)); 
    for i=1:size(Y,1) 
     d = bsxfun(@minus, X, Y(i,:)); 
     s = bsxfun(@plus, X, Y(i,:)); 
     D(:,i) = sum(d.^2 ./ (s/2+eps), 2); 
    end 
    D = 1 - D; 
end 

現在使用相同數據集考慮下面的例子爲(從我的previous answer適應代碼)你:

%# read dataset 
[label,data] = libsvmread('./heart_scale'); 
data = full(data);  %# sparse to full 

%# normalize data to [0,1] range 
mn = min(data,[],1); mx = max(data,[],1); 
data = bsxfun(@rdivide, bsxfun(@minus, data, mn), mx-mn); 

%# split into train/test datasets 
trainData = data(1:150,:); testData = data(151:270,:); 
trainLabel = label(1:150,:); testLabel = label(151:270,:); 
numTrain = size(trainData,1); numTest = size(testData,1); 

%# compute kernel matrices between every pairs of (train,train) and 
%# (test,train) instances and include sample serial number as first column 
K = [ (1:numTrain)' , chi2Kernel(trainData,trainData) ]; 
KK = [ (1:numTest)' , chi2Kernel(testData,trainData) ]; 

%# view 'train vs. train' kernel matrix 
figure, imagesc(K(:,2:end)) 
colormap(pink), colorbar 

%# train model 
model = svmtrain(trainLabel, K, '-t 4'); 

%# test on testing data 
[predTestLabel, acc, decVals] = svmpredict(testLabel, KK, model); 
cmTest = confusionmat(testLabel,predTestLabel) 

%# test on training data 
[predTrainLabel, acc, decVals] = svmpredict(trainLabel, K, model); 
cmTrain = confusionmat(trainLabel,predTrainLabel) 

所述測試數據結果:

Accuracy = 84.1667% (101/120) (classification) 
cmTest = 
    62  8 
    11 39 

和訓練數據,我們得到約90%的準確率,你期望:

Accuracy = 92.6667% (139/150) (classification) 
cmTrain = 
    77  3 
    8 62 

train_train_kernel_matrix

+1

哦,酷 - 這是一個詳細的答案。感謝您花時間思考我的問題。它肯定有幫助。 – Sallos

+2

@Sallos:很高興我可以幫忙,請考慮[接受](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)一個答案,如果它解決了問題 – Amro