2014-11-03 48 views
2

這裏是我的Python塊(2.7,[我學會了Python 3,所以使用未來的print_function來獲得我習慣使用的打印格式])來自scikit-learn的學習代碼,這些代碼都是由於企業IT策略而被鎖定的。它使用SVC引擎。我不明白的是,第一個(使用simple_clf)和第二個之間的+/- 1情況下得到的結果是不同的。但結構上,我認爲它們與第一次處理和整個數據數據一次完全相同,第二次只使用一次數據1片。但結果並不一致。爲平均(平均)分數生成的值應爲小數百分比(0.0至1.0)。在某些情況下,這種差異很小,但其他人卻大到足以讓我問我的問題。需要更好地理解Python scikit-learn fit預測循環與線性結果

from __future__ import print_function 
import os 
import numpy as np 
from numpy import array, loadtxt 
from sklearn import cross_validation, datasets, svm, preprocessing, grid_search 
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import precision_score 

GRADES = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'M'] 

# Initial processing 
featurevecs = loadtxt(FEATUREVECFILE) 
f = open(SCORESFILE) 
scorelines = f.readlines()[ 1: ] # Skip header line 
f.close() 
scorenums = [ GRADES.index(l.split('\t')[ 1 ]) for l in scorelines ] 
scorenums = array(scorenums) 

# Need this step to normalize the feature vectors 
scaler = preprocessing.Scaler() 
scaler.fit(featurevecs) 
featurevecs = scaler.transform(featurevecs) 

# Break up the vector into a training and testing vector 
# Need to keep the training set somewhat large to get enough of the 
# scarce results in the training set or the learning fails 
X_train, X_test, y_train, y_test = train_test_split(
    featurevecs, scorenums, test_size = 0.333, random_state = 0) 

# Define a range of parameters we can use to do a grid search 
# for the 'best' ones. 
CLFPARAMS = {'gamma':[.0025, .005, 0.09, .01, 0.011, .02, .04], 
      'C':[200, 300, 400, 500, 600]} 

# do a simple cross validation 
simple_clf = svm.SVC() 
simple_clf = grid_search.GridSearchCV(simple_clf, CLFPARAMS, cv = 3) 
simple_clf.fit(X_train, y_train) 
y_true, y_pred = y_test, simple_clf.predict(X_test) 
match = 0 
close = 0 
count = 0 
deviation = [] 
for i in range(len(y_true)): 
    count += 1 
    delta = np.abs(y_true[ i ] - y_pred[ i ]) 
    if(delta == 0): 
     match += 1 
    elif(delta == 1): 
     close += 1 
    deviation = np.append(deviation, 
          float(np.sum(np.abs(delta) <= 1))) 
avg = float(match)/float(count) 
close_avg = float(close)/float(count) 
#deviation.mean() = avg + close_avg 
print('{0} Accuracy (+/- 0) {1:0.4f} Accuracy (+/- 1) {2:0.4f} (+/- {3:0.4f}) '.format(test_type, avg, deviation.mean(), deviation.std()/2.0,), end = "") 

# "Original" code 
# do LeaveOneOut item by item 
clf = svm.SVC() 
clf = grid_search.GridSearchCV(clf, CLFPARAMS, cv = 3) 
toleratePara = 1; 
thecurrentScoreGraded = [] 
loo = cross_validation.LeaveOneOut(n = len(featurevecs)) 
for train, test in loo: 
    try: 
     clf.fit(featurevecs[ train ], scorenums[ train ]) 
     rawPredictionResult = clf.predict(featurevecs[ test ]) 

     errorVec = scorenums[ test ] - rawPredictionResult; 
     print(len(errorVec), errorVec) 
     thecurrentScoreGraded = np.append(thecurrentScoreGraded, float(np.sum(np.abs(errorVec) <= toleratePara))/len(errorVec)) 
    except ValueError: 
     pass 
print('{0} Accuracy (+/- {1:d}) {2:0.4f} (+/- {3:0.4f})'.format(test_type, toleratePara, thecurrentScoreGraded.mean(), thecurrentScoreGraded.std()/2)) 

這是我的結果,你可以看到它們不匹配。我的實際工作任務是查看是否準確更改收集的數據類型以支持學習引擎的準確性,或者即使將數據組合到更大的教學矢量中也會有所幫助,因此您可以看到我正在處理大量組合。每對線都用於一種學習數據。第一行是我的結果,第二行是基於「原始」代碼的結果。

original Accuracy (+/- 0) 0.2771 Accuracy (+/- 1) 0.6024 (+/- 0.2447) 
         original Accuracy (+/- 1) 0.6185 (+/- 0.2429) 
upostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
         upostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398) 
npostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384) 
         npostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398) 
tancurv Accuracy (+/- 0) 0.2330 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
         tancurv Accuracy (+/- 1) 0.5831 (+/- 0.2465) 
npostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
         npostan Accuracy (+/- 1) 0.7003 (+/- 0.2291) 
nposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
         nposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453) 
upostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199) 
         upostan Accuracy (+/- 1) 0.7003 (+/- 0.2291) 
uposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466) 
         uposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453) 
upos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
         upos Accuracy (+/- 1) 0.6450 (+/- 0.2393) 
npos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293) 
         npos Accuracy (+/- 1) 0.6450 (+/- 0.2393) 
curv Accuracy (+/- 0) 0.1553 Accuracy (+/- 1) 0.4854 (+/- 0.2499) 
         curv Accuracy (+/- 1) 0.5570 (+/- 0.2484) 
tan Accuracy (+/- 0) 0.3107 Accuracy (+/- 1) 0.7184 (+/- 0.2249) 
         tan Accuracy (+/- 1) 0.7231 (+/- 0.2237) 

回答

0

你是什麼意思「結構上它們是相同的」? 您使用不同的子集進行培訓和測試,並且它們具有不同的大小。 如果您不使用完全相同的培訓數據,我不明白爲什麼您希望結果相同。

順便說一句,也看看the note on LOO in the documentation。 LOO可能有很高的差異。

+0

文檔中的高差異位讓我非常不習慣使用這些結果。我對此完全陌生,被告知他們「相當」,我應該得到相同的結果,這就是爲什麼我問這個問題。但即使這是真的,廁所可能如此變化的事實意味着它們可能會有所不同。我現在看到你對不同子集的含義,所以我不知道爲什麼我被告知。我必須回到那個人身上並獲得額外的澄清。 – bsoplinger 2014-11-04 16:16:27