2014-11-03 48 views

這裏是我的Python塊(2.7,[我學會了Python 3,所以使用未來的print_function來獲得我習慣使用的打印格式])來自scikit-learn的學習代碼,這些代碼都是由於企業IT策略而被鎖定的。它使用SVC引擎。我不明白的是,第一個(使用simple_clf)和第二個之間的+/- 1情況下得到的結果是不同的。但結構上,我認爲它們與第一次處理和整個數據數據一次完全相同,第二次只使用一次數據1片。但結果並不一致。爲平均(平均)分數生成的值應爲小數百分比(0.0至1.0)。在某些情況下,這種差異很小,但其他人卻大到足以讓我問我的問題。需要更好地理解Python scikit-learn fit預測循環與線性結果

from __future__ import print_function 
import os 
import numpy as np 
from numpy import array, loadtxt 
from sklearn import cross_validation, datasets, svm, preprocessing, grid_search 
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import precision_score 

GRADES = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'M'] 

# Initial processing 
featurevecs = loadtxt(FEATUREVECFILE) 
f = open(SCORESFILE) 
scorelines = f.readlines()[ 1: ] # Skip header line 
scorenums = [ GRADES.index(l.split('\t')[ 1 ]) for l in scorelines ] 
scorenums = array(scorenums) 

# Need this step to normalize the feature vectors 
scaler = preprocessing.Scaler() 
featurevecs = scaler.transform(featurevecs) 

# Break up the vector into a training and testing vector 
# Need to keep the training set somewhat large to get enough of the 
# scarce results in the training set or the learning fails 
X_train, X_test, y_train, y_test = train_test_split(
    featurevecs, scorenums, test_size = 0.333, random_state = 0) 

# Define a range of parameters we can use to do a grid search 
# for the 'best' ones. 
CLFPARAMS = {'gamma':[.0025, .005, 0.09, .01, 0.011, .02, .04], 
      'C':[200, 300, 400, 500, 600]} 

# do a simple cross validation 
simple_clf = svm.SVC() 
simple_clf = grid_search.GridSearchCV(simple_clf, CLFPARAMS, cv = 3) 
simple_clf.fit(X_train, y_train) 
y_true, y_pred = y_test, simple_clf.predict(X_test) 
match = 0 
close = 0 
count = 0 
deviation = [] 
for i in range(len(y_true)): 
    count += 1 
    delta = np.abs(y_true[ i ] - y_pred[ i ]) 
    if(delta == 0): 
     match += 1 
    elif(delta == 1): 
     close += 1 
    deviation = np.append(deviation, 
          float(np.sum(np.abs(delta) <= 1))) 
avg = float(match)/float(count) 
close_avg = float(close)/float(count) 
#deviation.mean() = avg + close_avg 
print('{0} Accuracy (+/- 0) {1:0.4f} Accuracy (+/- 1) {2:0.4f} (+/- {3:0.4f}) '.format(test_type, avg, deviation.mean(), deviation.std()/2.0,), end = "") 

# "Original" code 
# do LeaveOneOut item by item 
clf = svm.SVC() 
clf = grid_search.GridSearchCV(clf, CLFPARAMS, cv = 3) 
toleratePara = 1; 
thecurrentScoreGraded = [] 
loo = cross_validation.LeaveOneOut(n = len(featurevecs)) 
for train, test in loo: 
     clf.fit(featurevecs[ train ], scorenums[ train ]) 
     rawPredictionResult = clf.predict(featurevecs[ test ]) 

     errorVec = scorenums[ test ] - rawPredictionResult; 
     print(len(errorVec), errorVec) 
     thecurrentScoreGraded = np.append(thecurrentScoreGraded, float(np.sum(np.abs(errorVec) <= toleratePara))/len(errorVec)) 
    except ValueError: 
print('{0} Accuracy (+/- {1:d}) {2:0.4f} (+/- {3:0.4f})'.format(test_type, toleratePara, thecurrentScoreGraded.mean(), thecurrentScoreGraded.std()/2)) 


original Accuracy (+/- 0) 0.2771 Accuracy (+/- 1) 0.6024 (+/- 0.2447) 
         original Accuracy (+/- 1) 0.6185 (+/- 0.2429) 
upostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
         upostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398) 
npostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
         npostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398) 
tancurv Accuracy (+/- 0) 0.2330 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
         tancurv Accuracy (+/- 1) 0.5831 (+/- 0.2465) 
npostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
         npostan Accuracy (+/- 1) 0.7003 (+/- 0.2291) 
nposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
         nposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453) 
upostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
         upostan Accuracy (+/- 1) 0.7003 (+/- 0.2291) 
uposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
         uposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453) 
upos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
         upos Accuracy (+/- 1) 0.6450 (+/- 0.2393) 
npos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
         npos Accuracy (+/- 1) 0.6450 (+/- 0.2393) 
curv Accuracy (+/- 0) 0.1553 Accuracy (+/- 1) 0.4854 (+/- 0.2499) 
         curv Accuracy (+/- 1) 0.5570 (+/- 0.2484) 
tan Accuracy (+/- 0) 0.3107 Accuracy (+/- 1) 0.7184 (+/- 0.2249) 
         tan Accuracy (+/- 1) 0.7231 (+/- 0.2237) 



你是什麼意思「結構上它們是相同的」? 您使用不同的子集進行培訓和測試,並且它們具有不同的大小。 如果您不使用完全相同的培訓數據,我不明白爲什麼您希望結果相同。

順便說一句,也看看the note on LOO in the documentation。 LOO可能有很高的差異。


文檔中的高差異位讓我非常不習慣使用這些結果。我對此完全陌生,被告知他們「相當」,我應該得到相同的結果,這就是爲什麼我問這個問題。但即使這是真的,廁所可能如此變化的事實意味着它們可能會有所不同。我現在看到你對不同子集的含義,所以我不知道爲什麼我被告知。我必須回到那個人身上並獲得額外的澄清。 – bsoplinger 2014-11-04 16:16:27