您可以直接訪問您的vectoriser的vocabulary_
屬性,並且你可以通過_tfidf._idf_diag
訪問idf_
矢量,所以這將是可能的猴子補丁是這樣的:
import re
import numpy as np
from scipy.sparse.dia import dia_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
def partial_fit(self, X):
max_idx = max(self.vocabulary_.values())
for a in X:
#update vocabulary_
if self.lowercase: a = a.lower()
tokens = re.findall(self.token_pattern, a)
for w in tokens:
if w not in self.vocabulary_:
max_idx += 1
self.vocabulary_[w] = max_idx
#update idf_
df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf
self.n_docs += 1
df.resize(len(self.vocabulary_))
for w in tokens:
df[self.vocabulary_[w]] += 1
idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1
self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf)))
TfidfVectorizer.partial_fit = partial_fit
articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
vec = TfidfVectorizer()
vec.fit(articleList)
vec.n_docs = len(articleList)
vec.partial_fit(['the last text I wanted to add'])
vec.transform(['the last text I wanted to add']).toarray()
# array([[ 0. , 0. , 0. , 0. , 0. ,
# 0. , 0. , 0. , 0. , 0. ,
# 0. , 0. , 0.27448674, 0. , 0.43003652,
# 0.43003652, 0.43003652, 0.43003652, 0.43003652]])
謝謝你花時間回答。我試圖使用這個作爲搜索索引,使用cosine_similarity生成相關結果列表。每次添加希望添加新文檔時,不必重新關閉我的整個語料庫就會很好。 –
嘿霍華德,我研究出如何更新'idf_',看看我編輯的答案 – maxymoo
太棒了!謝謝你的迴應! –