如何在gensim中使用TaggedDocument？

我有兩個目錄，我想讀他們的文本文件和標籤，但我不知道如何通過TaggedDocument做到這一點，我認爲它會作爲TaggedDocument（[字符串]，[標籤]），但這doesn顯然工作。這是我的代碼：如何在gensim中使用TaggedDocument？

from gensim import models 
from gensim.models.doc2vec import TaggedDocument 
import utilities as util 
import os 
from sklearn import svm 
from nltk.tokenize import sent_tokenize 
CogPath = "./FixedCog/" 
NotCogPath = "./FixedNotCog/" 
SamplePath ="./Sample/" 
docs = [] 
tags = [] 
CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')] 
NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')] 
SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')] 
for doc in CogList: 
    str = open(CogPath+doc,'r').read().decode("utf-8") 
    docs.append(str) 
    print docs 
    tags.append(doc) 
    print "###########" 
    print tags 
    print "!!!!!!!!!!!" 
for doc in NotCogList: 
    str = open(NotCogPath+doc,'r').read().decode("utf-8") 
    docs.append(str) 
    tags.append(doc) 
for doc in SampleList: 
    str = open(SamplePath + doc, 'r').read().decode("utf-8") 
    docs.append(str) 
    tags.append(doc) 

T = TaggedDocument(docs,tags) 

model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)

，這是我得到的錯誤：

Traceback (most recent call last): 
    File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module> 
    model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50) 
    File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__ 
    self.build_vocab(documents, trim_rule=trim_rule) 
    File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab 
    self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey 
    File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab 
    if isinstance(document.words, string_types): 
AttributeError: 'list' object has no attribute 'words'

來源

2017-07-16 Farhood

與你的主要問題分開：結尾'min_alpha'與開始'alpha'的值相同意味着你的訓練沒有做適當的隨機梯度下降。此外，'min_count = 1'在Word2Vec/Doc2Vec培訓中很少有幫助 - 保留這些罕見的詞語往往會使培訓花費更長時間，並干擾剩餘詞彙vecs/doc-vecs的質量。 – gojomo

約'min_alpha'，我已經從樣品代碼後跟此代碼複製它： '有效範圍內的歷元（10）： model.train（文檔） model.alpha - = 0.002 ＃減少學習率 model.min_alpha = model.alpha＃修復學習率，沒有衰退和關於'min_count'：我的數據集非常有限，有些詞不是很頻繁，但重量很多在這個意義上，我也過濾了大多數停用詞和頻繁的日常用語。 – Farhood

這是一個糟糕的樣本。如果您在創建Doc2Vec實例時將自己的語料庫傳遞給它，它將自動執行所有培訓過程，並自動將學習速率從'alpha'管理到'min_alpha'，並且不應該調用'train（）'你自己。（如果你這樣做，就像你沒有任何其他的細節一樣，最新的gensim版本會拋出一個錯誤，因爲這是一個常見的錯誤。）自己或者多次調用'train（）'是一個罕見的專家默認的'alpha' /'min_alpha'。 – gojomo

所以，我只是嘗試了一下，發現這個在GitHub上：

class TaggedDocument(namedtuple('TaggedDocument', 'words tags')): 
    """ 
    A single document, made up of `words` (a list of unicode string tokens) 
    and `tags` (a list of tokens). Tags may be one or more unicode string 
    tokens, but typical practice (which will also be most memory-efficient) is 
    for the tags list to include a unique integer id as the only tag. 

    Replaces "sentence as a list of words" from Word2Vec.

，所以我決定通過爲每個文檔生成TaggedDocument類來更改我使用TaggedDocument函數的方式，重要的是您必須將標記作爲列表傳遞。

for doc in CogList: 
    str = open(CogPath+doc,'r').read().decode("utf-8") 
    str_list = str.split() 
    T = TaggedDocument(str_list,[doc]) 
    docs.append(T)

來源

2017-07-16 07:21:18 Farhood

是的：'Doc2Vec'預計該語料庫是一個可迭代的集合，其中每個單獨的項目（文檔）形如「TaggedDocument」。（也就是說，它有一個「單詞列表」和「標籤」列表。） – gojomo

如何在gensim中使用TaggedDocument？

回答

相關問題