2017-07-17 71 views
0

如何從適合的GridSearchCV中提取最佳管道,以便我可以將它傳遞給cross_val_predict從GridSearchCV提取最佳管道cross_val_predict

直接傳遞符合GridSearchCV對象導致cross_val_predict再次運行整個網格搜索,我只想讓最好的管道受到cross_val_predict評估。

我的自包含代碼如下:

from sklearn.datasets import fetch_20newsgroups 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.svm import SVC 
from sklearn.multiclass import OneVsRestClassifier 
from sklearn.pipeline import Pipeline 
from sklearn.grid_search import GridSearchCV 
from sklearn.model_selection import cross_val_predict 
from sklearn.model_selection import StratifiedKFold 
from sklearn import metrics 

# fetch data data 
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), categories=['comp.graphics', 'rec.sport.baseball', 'sci.med']) 
X = newsgroups.data 
y = newsgroups.target 

# setup and run GridSearchCV 
wordvect = TfidfVectorizer(analyzer='word', lowercase=True) 
classifier = OneVsRestClassifier(SVC(kernel='linear', class_weight='balanced')) 
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)]) 
scoring = 'f1_weighted' 
parameters = { 
      'vect__min_df': [1, 2], 
      'vect__max_df': [0.8, 0.9], 
      'classifier__estimator__C': [0.1, 1, 10] 
      } 
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=8, scoring=scoring, verbose=1) 
gs_clf = gs_clf.fit(X, y) 

### outputs: Fitting 3 folds for each of 12 candidates, totalling 36 fits 

# manually extract the best models from the grid search to re-build the pipeline 
best_clf = gs_clf.best_estimator_.named_steps['classifier'] 
best_vectorizer = gs_clf.best_estimator_.named_steps['vect'] 
best_pipeline = Pipeline([('best_vectorizer', best_vectorizer), ('classifier', best_clf)]) 

# passing gs_clf here would run the grind search again inside cross_val_predict 
y_predicted = cross_val_predict(pipeline, X, y) 
print(metrics.classification_report(y, y_predicted, digits=3)) 

什麼我目前做的是手動重新構建從best_estimator_管道。但是我的管道通常有更多的步驟,例如SVD或PCA,有時我會增加或刪除步驟,並重新運行網格搜索來探索數據。當手動重建流水線時,這個步驟必須總是重複,這很容易出錯。

有沒有辦法直接從契合GridSearchCV提取最佳管道,以便我可以將它傳遞給cross_val_predict

回答

1
y_predicted = cross_val_predict(gs_clf.best_estimator_, X, y) 

作品和回報:

Fitting 3 folds for each of 12 candidates, totalling 36 fits 
[Parallel(n_jobs=4)]: Done 36 out of 36 | elapsed: 43.6s finished 
      precision recall f1-score support 

      0  0.920  0.911  0.916  584 
      1  0.894  0.943  0.918  597 
      2  0.929  0.887  0.908  594 

avg/total  0.914  0.914  0.914  1775 

[編輯]當我試圖再次通過簡單地pipeline(原管道)的代碼,它返回相同的輸出(如沒有經過best_pipeline)。所以有可能你只需要使用Pipeline本身,但我不是100%的。