Gensim Doc2Vec模型只生成有限數量的向量

我正在使用gensim Doc2Vec模型來生成我的特徵向量。這裏是我使用的代碼（我已經解釋了我的問題是在代碼是什麼）：Gensim Doc2Vec模型只生成有限數量的向量

cores = multiprocessing.cpu_count() 

# creating a list of tagged documents 
training_docs = [] 

# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences) 
for index, doc in enumerate(all_docs): 
    # 'doc' is in unicode format and I have already preprocessed it 
    training_docs.append(TaggedDocument(doc.split(), str(index+1))) 

# at this point, I have 53 strings in my 'training_docs' list 

model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores) 

# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list. 
print(len(model.docvecs)) 
# output: 10

我只是想知道或者如果我做了一個錯誤，如果有任何其他的參數，我應該設置？

更新：我是用標籤打參數TaggedDocument，當我改成了文字和數字的混合物等：文檔1，文檔2，...我看到生成的向量的數量不同的數字，但仍然沒有預期的特徵向量數量相同。

來源

2017-08-02 Pedram

看看它在你的陰莖已經發現實際標籤：

print(model.docvecs.offset2doctag)

你看到一個模式？

每個文檔的tags屬性應該是標籤一個列表，而不是一個單一的標籤。如果您提供一個簡單的整數字符串，它會將其看作一個數字列表，因此只能學習標籤'0','1'，...，'9'。

您可以用代替str(index+1)並獲得您期望的行爲。

但是，由於您的文檔ID只是升序整數，您也可以使用普通的Python ints作爲您的doctag。這將節省一些內存，避免從string-tag到array-slot（int）的查找字典的創建。爲此，請將str(index+1)替換爲[index]。（這會從0開始doc-IDs--這是一個比tethy更多的Pythonic，並且還避免浪費未使用的0在保存訓練好的向量的原始數組中的位置。）

來源

2017-08-03 01:29:10 gojomo

Gensim Doc2Vec模型只生成有限數量的向量

回答

相關問題