python：如何計算兩個單詞列表的餘弦相似度？

我想計算兩個列表的餘弦相似類似以下內容：python：如何計算兩個單詞列表的餘弦相似度？

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory'] 

B = [u'home (private)', u'school', u'bank', u'shopping mall']

我知道的餘弦相似性和乙方應

3/(sqrt(7)*sqrt(4)).

我試圖名單改造成類似'形式家庭銀行建築工廠「，看起來像一句話，然而，一些元素（例如家庭（私人））本身具有空白空間，一些元素有括號，所以我覺得難以計算單詞的出現。

你知道如何計算這個複雜的列表中的詞彙出現，這樣對於列表B，詞彙出現可以表示爲

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}?

或者你知道如何計算這兩者的餘弦相似名單？

非常感謝您

來源

2015-03-02 gladys0313

你如何定義'餘弦similarity'？ 3 /（sqrt（7）* sqrt（4））''來自哪裏？ – ZdaR 2015-03-02 21:10:26

我只知道定義餘弦相似度的一種方法，就像A = [2,1,1,1,0,0]和B = [A，B，B]一樣， 1,1,0,0,1,1]，它們的餘弦相似度爲3 /（sqrt（7）* sqrt（4）） – gladys0313 2015-03-03 06:04:24

from collections import Counter 

# word-lists to compare 
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory'] 
b = [u'home (private)', u'school', u'bank', u'shopping mall'] 

# count word occurrences 
a_vals = Counter(a) 
b_vals = Counter(b) 

# convert to word-vectors 
words = list(a_vals.keys() | b_vals.keys()) 
a_vect = [a_vals.get(word, 0) for word in words]  # [0, 0, 1, 1, 2, 1] 
b_vect = [b_vals.get(word, 0) for word in words]  # [1, 1, 1, 0, 1, 0] 

# find cosine 
len_a = sum(av*av for av in a_vect) ** 0.5    # sqrt(7) 
len_b = sum(bv*bv for bv in b_vect) ** 0.5    # sqrt(4) 
dot = sum(av*bv for av,bv in zip(a_vect, b_vect)) # 3 
cosine = dot/(len_a * len_b)       # 0.5669467

來源

2015-03-02 21:22:12

非常感謝您的回答。它似乎很酷，但在單詞= list（a_vals.keys（）| b_vals.keys（））中，解釋器說'TypeError：不受支持的操作數類型爲|：'list'和'list'。任何想法？ ' – gladys0313 2015-03-03 06:21:01

對不起，我在Python 3.4中測試過。對於2.x，你會做'word = list（set（a_vals）| set（b_vals））'。 – 2015-03-03 12:31:07

啊，非常感謝 – gladys0313 2015-03-04 15:57:56

首先構建一個字典（這是所有不同的詞在一組或黃列表中的技術術語）。

vocab = {} 
i = 0 

# loop through each list, find distinct words and map them to a 
# unique number starting at zero 

for word in A: 
    if word not in vocab: 
     vocab[word] = i 
     i += 1 


for word in B: 
    if word not in vocab: 
     vocab[word] = i 
     i += 1

vocab字典現在將每個單詞映射到從零開始的唯一編號。我們將使用這些數字作爲索引到數組（或向量）中。

在接下來的步驟中，我們將爲每個輸入列表創建一個稱爲術語頻率矢量的術語。我們將在這裏使用一個名爲numpy的庫。這是進行這種科學計算的一種非常流行的方式。如果你對餘弦相似性（或其他機器學習技術）感興趣，那就值得你花時間。

import numpy as np 

# create a numpy array (vector) for each input, filled with zeros 
a = np.zeros(len(vocab)) 
b = np.zeros(len(vocab)) 

# loop through each input and create a corresponding vector for it 
# this vector counts occurrences of each word in the dictionary 

for word in A: 
    index = vocab[word] # get index from dictionary 
    a[index] += 1 # increment count for that index 

for word in B: 
    index = vocab[word] 
    b[index] += 1

最後一步是實際計算餘弦相似度。

# use numpy's dot product to calculate the cosine similarity 
sim = np.dot(a, b)/np.sqrt(np.dot(a, a) * np.dot(b, b))

變量sim現在包含你的答案。您可以拉出每個這些子表達式，並驗證它們是否與您的原始公式匹配。

稍微重構一下這種技術是非常可縮放的（相對較大數量的輸入列表，具有相對大量的不同單詞）。對於非常大的語料庫（如維基百科），您應該查看爲這類事情製作的自然語言處理庫。這裏有一些好的。

來源

2015-11-03 16:29:51

python：如何計算兩個單詞列表的餘弦相似度？

回答

相關問題