使用TFIDF的餘弦相似度

在SO和Web上有幾個問題描述如何在兩個字符串之間採用cosine similarity，甚至在TFIDF作爲權重的兩個字符串之間。但是像scikit的linear_kernel這樣的函數的輸出讓我有點困惑。使用TFIDF的餘弦相似度

考慮下面的代碼：

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 

a = ['hello world', 'my name is', 'what is your name?'] 
b = ['my name is', 'hello world', 'my name is what?'] 

df = pd.DataFrame(data={'a':a, 'b':b}) 
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1) 
print(df.head()) 

        a     b         ab 
0   hello world  my name is    hello world my name is 
1   my name is  hello world    my name is hello world 
2 what is your name? my name is what? what is your name? my name is what?

問題：我想有一列，它是在a字符串和b琴絃之間的餘弦相似性。

我試過：

我培養了TFIDF分類上ab，以包括所有的話：

clf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english') 
clf.fit(df['ab'])

然後我得到了兩個a和b列的稀疏TFIDF矩陣：

tfidf_a = clf.transform(df['a']) 
tfidf_b = clf.transform(df['b'])

現在，如果我使用scikit的linear_kernel，這是別人推薦的，我得到了一個格式矩陣（nfeatures，nfeatures），正如他們的文檔中提到的那樣。

from sklearn.metrics.pairwise import linear_kernel 
linear_kernel(tfidf_a,tfidf_b) 

array([[ 0., 1., 0.], 
     [ 0., 0., 0.], 
     [ 0., 0., 0.]])

但我需要的是一個簡單的矢量，其中所述第一元件是a第一行和b第一行，所述第二元件是所述cos_sim之間的cosin_sim（A [1]，B [ 1]）等等。

使用python3，scikit-learn 0.17。

來源

2016-04-21 David

我認爲你的例子有點下降，因爲你的TfidfVectorizer過濾了大部分詞彙，因爲你有stop_words ='english'參數（你在示例中包含了幾乎所有的停用詞）。我已經刪除了它，並且讓您的矩陣密集，以便我們可以看到發生了什麼。如果你做了這樣的事情怎麼辦？

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 
from scipy import spatial 

a = ['hello world', 'my name is', 'what is your name?'] 
b = ['my name is', 'hello world', 'my name is what?'] 

df = pd.DataFrame(data={'a':a, 'b':b}) 
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1) 

clf = TfidfVectorizer(ngram_range=(1, 1)) 
clf.fit(df['ab']) 

tfidf_a = clf.transform(df['a']).todense() 
tfidf_b = clf.transform(df['b']).todense() 

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ] 
row_similarities 

[0.0, 0.0, 0.72252389079716417]

這顯示了每一行之間的距離。我沒有完全掌握如何構建完整的語料庫，但這個例子並沒有完全優化，所以我現在就離開它。希望這可以幫助。

來源

2016-04-23 16:14:29 flyingmeatball

謝謝，這工作。你爲什麼不跟我如何構建完整的語料庫？ – David

因爲通常有比使用.apply這種類型的任務更好的方法。這裏有6個文檔，兩列中有3行，是否有兩個單獨的文檔（a和b），或者是否有3個文檔（每行一個）。這對計算TFIDF中的頻率很重要，我不確定您構建ab的方式現在反映了您的意圖。 – flyingmeatball

dfs = {} 
idfs = {} 
speeches = {} 
speechvecs = {} 
total_word_counts = {} 

def tokenize(doc): 
    tokens = mytokenizer.tokenize(doc) 
    lowertokens = [token.lower() for token in tokens] 
    filteredtokens = [stemmer.stem(token) for token in lowertokens if not token in sortedstopwords] 
    return filteredtokens 

def incdfs(tfvec): 
    for token in set(tfvec): 
     if token not in dfs: 
      dfs[token]=1 
      total_word_counts[token] = tfvec[token] 
     else: 
      dfs[token] += 1 
      total_word_counts[token] += tfvec[token] 


def calctfidfvec(tfvec, withidf): 
    tfidfvec = {} 
    veclen = 0.0 

    for token in tfvec: 
     if withidf: 
      tfidf = (1+log10(tfvec[token])) * getidf(token) 
     else: 
      tfidf = (1+log10(tfvec[token])) 
     tfidfvec[token] = tfidf 
     veclen += pow(tfidf,2) 

    if veclen > 0: 
     for token in tfvec: 
      tfidfvec[token] /= sqrt(veclen) 

    return tfidfvec 

def cosinesim(vec1, vec2): 
    commonterms = set(vec1).intersection(vec2) 
    sim = 0.0 
    for token in commonterms: 
     sim += vec1[token]*vec2[token] 

    return sim 

def query(qstring): 
    qvec = getqvec(qstring.lower()) 
    scores = {filename:cosinesim(qvec,tfidfvec) for filename, tfidfvec in speechvecs.items()} 
    return max(scores.items(), key=operator.itemgetter(1))[0] 

def docdocsim(filename1,filename2): 
    return cosinesim(gettfidfvec(filename1),gettfidfvec(filename2))

來源

2016-10-20 02:39:00

儘管這段代碼可能會解決問題，但它並不能解釋爲什麼或如何回答問題。請[請爲您的代碼添加解釋]（// meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers），因爲這確實有助於提高帖子的質量。請記住，您將來會爲讀者回答問題，而這些人可能不知道您的代碼建議的原因。 –

使用TFIDF的餘弦相似度

回答

相關問題