預測中間詞word2vec

我有官方github倉庫中的predict_output_word方法。它只採用用skip-gram訓練過的wod2vec模型，並嘗試通過將所有輸入詞的索引的向量相加來預測中間詞，並且通過輸入詞索引的np_sum的長度來分割該中間詞。然後，考慮輸出並採用softmax獲得預測詞的概率，然後將所有這些概率相加得到最可能的單詞。有沒有更好的方法來處理這個問題，以獲得更好的單詞，因爲這給較短的句子帶來了非常不好的結果。下面的代碼是github的代碼。預測中間詞word2vec

def predict_output_word(model, context_words_list, topn=10): 

from numpy import exp, dtype, float32 as REAL,\ 
ndarray, empty, sum as np_sum, 
from gensim import utils, matutils 

"""Report the probability distribution of the center word given the context words as input to the trained model.""" 
if not model.negative: 
    raise RuntimeError("We have currently only implemented predict_output_word " 
     "for the negative sampling scheme, so you need to have " 
     "run word2vec with negative > 0 for this to work.") 

if not hasattr(model.wv, 'syn0') or not hasattr(model, 'syn1neg'): 
    raise RuntimeError("Parameters required for predicting the output words not found.") 

word_vocabs = [model.wv.vocab[w] for w in context_words_list if w in model.wv.vocab] 
if not word_vocabs: 
    warnings.warn("All the input context words are out-of-vocabulary for the current model.") 
    return None 


word2_indices = [word.index for word in word_vocabs] 

#sum all the indices 
l1 = np_sum(model.wv.syn0[word2_indices], axis=0) 

if word2_indices and model.cbow_mean: 
    #l1 = l1/len(word2_indices) 
    l1 /= len(word2_indices) 

prob_values = exp(dot(l1, model.syn1neg.T))  # propagate hidden -> output and take softmax to get probabilities 
prob_values /= sum(prob_values) 
top_indices = matutils.argsort(prob_values, topn=topn, reverse=True) 

return [(model.wv.index2word[index1], prob_values[index1]) for index1 in top_indices] #returning the most probable output words with their probabilities

來源

2017-07-14 devc

歡迎來到StackOverflow。請閱讀並遵守幫助文檔中的發佈準則。 [最小，完整，可驗證的示例]（http://stackoverflow.com/help/mcve）適用於此處。在發佈您的MCVE代碼並準確描述問題之前，我們無法爲您提供有效的幫助。我們應該能夠將發佈的代碼粘貼到文本文件中，並重現您描述的問題。特別是，提供一個小數據集，給你帶來麻煩。沒有一個，目前還不清楚問題是算法，訓練的強度還是缺乏可靠的數據。 – Prune

雖然word2vec算法試圖通過聯想詞語列車字向量，然後將這些文字載體可以用於其他目的，也不太可能是理想的算法，如果字預測纔是你真正的目標。

大多數word2vec實現甚至沒有提供單獨的單詞預測的特定接口。在gensim中，01最近才被添加。它只適用於某些模式。它並不完全像在培訓期間那樣對待window--沒有有效的按距離加權。而且，它相當昂貴 - 從本質上檢查模型對每個詞的預測，然後報告前N個。（在訓練過程中發生的'預測'是'稀疏'的並且效率更高 - 只需運行足夠的模型即可推動它在單個示例中更好）。

如果單詞預測是您的真正目標，那麼您可能會從其他方法中獲得更好的結果，包括只計算一個大的查找表，以查看每個詞出現在其他n-gram附近的頻率。

來源

2017-07-14 16:30:05 gojomo

預測中間詞word2vec

回答

相關問題