如何使用scikit-learn加載先前保存的模型並使用新的培訓數據擴展模型

我使用scikit-learn，其中我已經使用unigrams將邏輯迴歸模型保存爲訓練集1的特徵。是否可以加載此模型，然後使用第二個訓練集（訓練集2）中的新數據實例進行擴展？如果是，那怎麼辦？這樣做的原因是因爲我對每個訓練集使用了兩種不同的方法（第一種方法涉及特徵腐敗/正則化，第二種方法涉及自我訓練）。如何使用scikit-learn加載先前保存的模型並使用新的培訓數據擴展模型

我添加了一些簡單的示例代碼清晰：

from sklearn.linear_model import LogisticRegression as log 
from sklearn.feature_extraction.text import CountVectorizer as cv 
import pickle 

trainText1 # Training set 1 text instances  
trainLabel1 # Training set 1 labels 
trainText2 # Training set 2 text instances  
trainLabel2 # Training set 2 labels 

clf = log() 
# Count vectorizer used by the logistic regression classifier 
vec = cv() 

# Fit count vectorizer with training text data from training set 1 
vec.fit(trainText1) 

# Transforms text into vectors for training set1 
train1Text1 = vec.transform(trainText1) 

# Fitting training set1 to the linear logistic regression classifier 
clf.fit(trainText1,trainLabel1) 

# Saving logistic regression model from training set 1 
modelFileSave = open('modelFromTrainingSet1', 'wb') 
pickle.dump(clf, modelFileSave) 
modelFileSave.close() 

# Loading logistic regression model from training set 1  
modelFileLoad = open('modelFromTrainingSet1', 'rb') 
clf = pickle.load(modelFileLoad) 

# I'm unsure how to continue from here....

來源

2014-11-03 sentimentMining

LogisticRegression內部使用liblinear求解器，不支持增量配件。相反，您可以使用SGDClassifier(loss='log')作爲partial_fit方法，但可以在實踐中使用該方法。其他超參數是不同的。小心網格搜索他們的最佳值仔細。請參閱SGDClassifier文檔瞭解這些超參數的含義。

CountVectorizer不支持增量擬合。您將不得不重新使用安裝在＃1列車上的矢量化器來轉換＃2。這意味着＃1中沒有出現＃2的任何標記將被完全忽略。這可能不是你所期望的。

爲了減輕這種影響，您可以使用HashingVectorizer，這是以無法知道特徵的含義爲代價的無狀態。請閱讀the documentation瞭解更多詳情。

來源

2014-11-03 14:24:41 ogrisel

如何使用scikit-learn加載先前保存的模型並使用新的培訓數據擴展模型

回答

相關問題