2017-10-16 138 views
0

我想在包含許多行的文件上使用TfidfVectorizer(),每個文本都包含一個短語。然後我想用一小部分短語做一個測試文件,做TfidfVectorizer(),然後取原始文件和測試文件之間的餘弦相似度,這樣對於測試文件中的給定短語,我可以檢索出前N個匹配原始文件。這裏是我的嘗試:Python:比較兩個不同尺寸的tfidf矩陣內的項目

corpus = tuple(open("original.txt").read().split('\n')) 
test = tuple(open("test.txt").read().split('\n')) 


from sklearn.feature_extraction.text import TfidfVectorizer 

tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english') 
tfidf_matrix = tf.fit_transform(corpus) 
tfidf_matrix2 = tf.fit_transform(test) 

from sklearn.metrics.pairwise import linear_kernel 


def new_find_similar(tfidf_matrix2, index, tfidf_matrix, top_n = 5): 
    cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten() 
    related_docs_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index] 
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n] 


for index, score in find_similar(tfidf_matrix, 1234567): 
     print score, corpus[index] 

但是我得到:

for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix): 
     print score, test[index] 
Traceback (most recent call last): 

    File "<ipython-input-53-2bf1cd465991>", line 1, in <module> 
    for index, score in new_find_similar(tfidf_matrix2, 1000, tfidf_matrix): 

    File "<ipython-input-51-da874b8d3076>", line 2, in new_find_similar 
    cosine_similarities = linear_kernel(tfidf_matrix2[index:index+1], tfidf_matrix).flatten() 

    File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 734, in linear_kernel 
    X, Y = check_pairwise_arrays(X, Y) 

    File "C:\Users\arron\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\metrics\pairwise.py", line 122, in check_pairwise_arrays 
    X.shape[1], Y.shape[1])) 

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 66662 while Y.shape[1] == 3332088 

我不會介意組合這兩個文件,然後轉化,但我想給b確保我不會從任何比較的短語測試文件中的其他詞組的測試文件。

任何指針?

回答

1

裝上TfidfVectorizer從語料數據,然後與已經安裝矢量化改造的試驗數據(即,不叫fit_transform兩次):

tfidf_matrix = tf.fit_transform(corpus) 
tfidf_matrix2 = tf.transform(test) 
+0

優秀,非常感謝。 – brucezepplin