向Sklearn TFIDIF向量添加新文本（Python）

是否有添加到現有語料庫的函數？我已經生成了我的矩陣，我期望定期添加到表格中，而無需重新計算整個Sha-bang向Sklearn TFIDIF向量添加新文本（Python）

例如;

articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now'] 
tfidf_vectorizer = TfidfVectorizer(
         max_df=.8, 
         max_features=2000, 
         min_df=.05, 
         preprocessor=prep_text, 
         use_idf=True, 
         tokenizer=tokenize_text 
        ) 
tfidf_matrix = tfidf_vectorizer.fit_transform(articleList) 

#### ADDING A NEW ARTICLE TO EXISTING SET? 
bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article I wanted to add'])

來源

2016-08-23 Howard Zoopaloopa

您可以直接訪問您的vectoriser的vocabulary_屬性，並且你可以通過_tfidf._idf_diag訪問idf_矢量，所以這將是可能的猴子補丁是這樣的：

import re 
import numpy as np 
from scipy.sparse.dia import dia_matrix 
from sklearn.feature_extraction.text import TfidfVectorizer 

def partial_fit(self, X): 
    max_idx = max(self.vocabulary_.values()) 
    for a in X: 
     #update vocabulary_ 
     if self.lowercase: a = a.lower() 
     tokens = re.findall(self.token_pattern, a) 
     for w in tokens: 
      if w not in self.vocabulary_: 
       max_idx += 1 
       self.vocabulary_[w] = max_idx 

     #update idf_ 
     df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf 
     self.n_docs += 1 
     df.resize(len(self.vocabulary_)) 
     for w in tokens: 
      df[self.vocabulary_[w]] += 1 
     idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1 
     self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf))) 

TfidfVectorizer.partial_fit = partial_fit 
articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now'] 
vec = TfidfVectorizer() 
vec.fit(articleList) 
vec.n_docs = len(articleList) 
vec.partial_fit(['the last text I wanted to add']) 
vec.transform(['the last text I wanted to add']).toarray() 

# array([[ 0.  , 0.  , 0.  , 0.  , 0.  , 
#   0.  , 0.  , 0.  , 0.  , 0.  , 
#   0.  , 0.  , 0.27448674, 0.  , 0.43003652, 
#   0.43003652, 0.43003652, 0.43003652, 0.43003652]])

來源

2016-08-24 04:41:50 maxymoo

謝謝你花時間回答。我試圖使用這個作爲搜索索引，使用cosine_similarity生成相關結果列表。每次添加希望添加新文檔時，不必重新關閉我的整個語料庫就會很好。 –

嘿霍華德，我研究出如何更新'idf_'，看看我編輯的答案 – maxymoo

太棒了！謝謝你的迴應！ –

向Sklearn TFIDIF向量添加新文本（Python）

回答

相關問題