這裏是我的Python塊(2.7,[我學會了Python 3,所以使用未來的print_function來獲得我習慣使用的打印格式])來自scikit-learn的學習代碼,這些代碼都是由於企業IT策略而被鎖定的。它使用SVC引擎。我不明白的是,第一個(使用simple_clf)和第二個之間的+/- 1情況下得到的結果是不同的。但結構上,我認爲它們與第一次處理和整個數據數據一次完全相同,第二次只使用一次數據1片。但結果並不一致。爲平均(平均)分數生成的值應爲小數百分比(0.0至1.0)。在某些情況下,這種差異很小,但其他人卻大到足以讓我問我的問題。需要更好地理解Python scikit-learn fit預測循環與線性結果
from __future__ import print_function
import os
import numpy as np
from numpy import array, loadtxt
from sklearn import cross_validation, datasets, svm, preprocessing, grid_search
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score
GRADES = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'M']
# Initial processing
featurevecs = loadtxt(FEATUREVECFILE)
f = open(SCORESFILE)
scorelines = f.readlines()[ 1: ] # Skip header line
f.close()
scorenums = [ GRADES.index(l.split('\t')[ 1 ]) for l in scorelines ]
scorenums = array(scorenums)
# Need this step to normalize the feature vectors
scaler = preprocessing.Scaler()
scaler.fit(featurevecs)
featurevecs = scaler.transform(featurevecs)
# Break up the vector into a training and testing vector
# Need to keep the training set somewhat large to get enough of the
# scarce results in the training set or the learning fails
X_train, X_test, y_train, y_test = train_test_split(
featurevecs, scorenums, test_size = 0.333, random_state = 0)
# Define a range of parameters we can use to do a grid search
# for the 'best' ones.
CLFPARAMS = {'gamma':[.0025, .005, 0.09, .01, 0.011, .02, .04],
'C':[200, 300, 400, 500, 600]}
# do a simple cross validation
simple_clf = svm.SVC()
simple_clf = grid_search.GridSearchCV(simple_clf, CLFPARAMS, cv = 3)
simple_clf.fit(X_train, y_train)
y_true, y_pred = y_test, simple_clf.predict(X_test)
match = 0
close = 0
count = 0
deviation = []
for i in range(len(y_true)):
count += 1
delta = np.abs(y_true[ i ] - y_pred[ i ])
if(delta == 0):
match += 1
elif(delta == 1):
close += 1
deviation = np.append(deviation,
float(np.sum(np.abs(delta) <= 1)))
avg = float(match)/float(count)
close_avg = float(close)/float(count)
#deviation.mean() = avg + close_avg
print('{0} Accuracy (+/- 0) {1:0.4f} Accuracy (+/- 1) {2:0.4f} (+/- {3:0.4f}) '.format(test_type, avg, deviation.mean(), deviation.std()/2.0,), end = "")
# "Original" code
# do LeaveOneOut item by item
clf = svm.SVC()
clf = grid_search.GridSearchCV(clf, CLFPARAMS, cv = 3)
toleratePara = 1;
thecurrentScoreGraded = []
loo = cross_validation.LeaveOneOut(n = len(featurevecs))
for train, test in loo:
try:
clf.fit(featurevecs[ train ], scorenums[ train ])
rawPredictionResult = clf.predict(featurevecs[ test ])
errorVec = scorenums[ test ] - rawPredictionResult;
print(len(errorVec), errorVec)
thecurrentScoreGraded = np.append(thecurrentScoreGraded, float(np.sum(np.abs(errorVec) <= toleratePara))/len(errorVec))
except ValueError:
pass
print('{0} Accuracy (+/- {1:d}) {2:0.4f} (+/- {3:0.4f})'.format(test_type, toleratePara, thecurrentScoreGraded.mean(), thecurrentScoreGraded.std()/2))
這是我的結果,你可以看到它們不匹配。我的實際工作任務是查看是否準確更改收集的數據類型以支持學習引擎的準確性,或者即使將數據組合到更大的教學矢量中也會有所幫助,因此您可以看到我正在處理大量組合。每對線都用於一種學習數據。第一行是我的結果,第二行是基於「原始」代碼的結果。
original Accuracy (+/- 0) 0.2771 Accuracy (+/- 1) 0.6024 (+/- 0.2447)
original Accuracy (+/- 1) 0.6185 (+/- 0.2429)
upostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384)
upostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
npostancurv Accuracy (+/- 0) 0.2718 Accuracy (+/- 1) 0.6505 (+/- 0.2384)
npostancurv Accuracy (+/- 1) 0.6417 (+/- 0.2398)
tancurv Accuracy (+/- 0) 0.2330 Accuracy (+/- 1) 0.5825 (+/- 0.2466)
tancurv Accuracy (+/- 1) 0.5831 (+/- 0.2465)
npostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199)
npostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
nposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466)
nposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upostan Accuracy (+/- 0) 0.3398 Accuracy (+/- 1) 0.7379 (+/- 0.2199)
upostan Accuracy (+/- 1) 0.7003 (+/- 0.2291)
uposcurv Accuracy (+/- 0) 0.2621 Accuracy (+/- 1) 0.5825 (+/- 0.2466)
uposcurv Accuracy (+/- 1) 0.5961 (+/- 0.2453)
upos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293)
upos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
npos Accuracy (+/- 0) 0.3689 Accuracy (+/- 1) 0.6990 (+/- 0.2293)
npos Accuracy (+/- 1) 0.6450 (+/- 0.2393)
curv Accuracy (+/- 0) 0.1553 Accuracy (+/- 1) 0.4854 (+/- 0.2499)
curv Accuracy (+/- 1) 0.5570 (+/- 0.2484)
tan Accuracy (+/- 0) 0.3107 Accuracy (+/- 1) 0.7184 (+/- 0.2249)
tan Accuracy (+/- 1) 0.7231 (+/- 0.2237)
文檔中的高差異位讓我非常不習慣使用這些結果。我對此完全陌生,被告知他們「相當」,我應該得到相同的結果,這就是爲什麼我問這個問題。但即使這是真的,廁所可能如此變化的事實意味着它們可能會有所不同。我現在看到你對不同子集的含義,所以我不知道爲什麼我被告知。我必須回到那個人身上並獲得額外的澄清。 – bsoplinger 2014-11-04 16:16:27