我想使用word2vectors計算兩個句子之間的相似度,我試圖獲得句子的向量,以便我可以計算句子向量的平均值以找到餘弦相似。我試過這段代碼,但它不工作。它給出了帶有1的句子向量的輸出。我想要在sentence_1_avg_vector & sentence_2_avg_vector中的句子的實際向量。查找2個句子使用word2vec與python相似度
代碼:
#DataSet#
sent1=[['What', 'step', 'step', 'guide', 'invest', 'share', 'market', 'india'],['What', 'story', 'Kohinoor', 'KohiNoor', 'Diamond']]
sent2=[['What', 'step', 'step', 'guide', 'invest', 'share', 'market'],['What', 'would', 'happen', 'Indian', 'government', 'stole', 'Kohinoor', 'KohiNoor', 'diamond', 'back']]
sentences=sent1+sent2
#''''Applying Word2vec''''#
word2vec_model=gensim.models.Word2Vec(sentences, size=100, min_count=5)
bin_file="vecmodel.csv"
word2vec_model.wv.save_word2vec_format(bin_file,binary=False)
#''''Making Sentence Vectors''''#
def avg_feature_vector(words, model, num_features, index2word_set):
#function to average all words vectors in a given paragraph
featureVec = np.ones((num_features,), dtype="float32")
#print(featureVec)
nwords = 0
#list containing names of words in the vocabulary
index2word_set = set(model.wv.index2word)# this is moved as input param for performance reasons
for word in words:
if word in index2word_set:
nwords = nwords+1
featureVec = np.add(featureVec, model[word])
print(featureVec)
if(nwords>0):
featureVec = np.divide(featureVec, nwords)
return featureVec
i=0
while i<len(sent1):
sentence_1_avg_vector = avg_feature_vector(mylist1, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word))
print(sentence_1_avg_vector)
sentence_2_avg_vector = avg_feature_vector(mylist2, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word))
print(sentence_2_avg_vector)
sen1_sen2_similarity = 1 - spatial.distance.cosine(sentence_1_avg_vector,sentence_2_avg_vector)
print(sen1_sen2_similarity)
i+=1
輸出這個代碼給出:
[ 1. 1. .... 1. 1.]
[ 1. 1. .... 1. 1.]
0.999999898245
[ 1. 1. .... 1. 1.]
[ 1. 1. .... 1. 1.]
0.999999898245
你想通過查找和平均預先計算的word2vec-vectors來計算你的句子的矢量表示,還是想從頭計算它們?你的代碼看起來像你正在嘗試後者......但我不認爲你可以從兩句話中學習任何有用的嵌入。人們通常會使用數百萬字。 – Tobias
也許這會有所幫助。 – alvas
這些實際上不是兩個句子..我的數據集包含8個lacs +句子行..爲了方便,我在這裏提到了一些示例數據來傳達我的概念... –