使用Mahout來訓練LDA並檢索它的主題

我嘗試了Apache Mahout，並且有大量關於如何使用LDA生成主題模型的信息，但是有關於如何使用它們執行相同操作的信息很少新的CVB lda算法。我想要做的是生成與原始ldatopic類似的話題的概率。使用Mahout來訓練LDA並檢索它的主題

任何信息或如何做到這一點的例子，將不勝感激！

謝謝！

UPDATE：

好了，我摸索出的這一個公平一點，但它仍然是不完整的，所以任何幫助將是巨大的！

來源

2012-07-25 toofarsideways

好吧，所以我仍然不知道如何輸出主題，但我已經弄清楚如何獲得cvb和我認爲是文檔向量，但是我沒有任何運氣傾銷它們，所以幫助這裏仍然不勝感激！

哦，不要忘記設置的值：上主，否則這一切都不工作

export MAHOUT_HOME=/home/sgeadmin/mahout 
export HADOOP_HOME=/usr/lib/hadoop 
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk 
export HADOOP_CONF_DIR=$HADOOP_HOME/conf

。

所以先上傳使用starclusters把（很明顯，如果你不使用starcluster跳過這個:)）的文件：

starcluster put mycluster text_train /home/sgeadmin/ 
starcluster put mycluster text_test /home/sgeadmin/

然後，我們需要將它們添加到Hadoop的HBase的文件系統（不要忘了-hadoop starcluster）：

dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster

然後調用亨利馬烏的seqdirectory把文成序列文件

$MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --output /user/sgeadmin/text_seq -c UTF-8 -ow

然後調用亨利馬烏的seq2parse把它們變成載體

$MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec -wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

最後調用cvb，我相信-dt船旗國其中推斷的主題應該去，但因爲我還沒有能夠甩掉他們，我可以沒有證實這一點。

$MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict /user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document -mt /user/sgeadmin/text_states

的-k標誌的主題數，則-nt標誌是字典的大小，則可以通過計數矢量內的dictionary.file-0的條目的數目（在/user/sgeadmin/text_vec下這種情況下）計算該和-x是迭代次數。

如果有人知道如何從這裏得到文檔主題概率，幫助將非常感謝！

來源

2012-07-28 22:29:59 toofarsideways

After completing aboveprocess,you can obtain an output of the computed topics using another Mahout utility called LDAPrintTopics.java by passing following commands 

--dict (-d) dict --------->Dictionary to read in, in the same 
              format as one created by 
              org.apache.mahout.utils.vectors.lucen 
              e.Driver 
    --output (-o) output--------->Output directory to write top words 
    --words (-w) words--------->Number of words to print 
    --input (-i) input--------->Path to an LDA output (a state) 
    --dictionaryType (-dt) dictionaryType--------->The dictionary file type 
              (text|sequencefile)

來源

2013-02-01 13:16:43 gangireddy

的文檔主題分佈存儲在序列文件格式，你用-dt或--doc_topic_output指定的目錄下，當你跑mahout cvb。在你的情況，這個目錄將是/user/sgeadmin/text_cvb_document

轉儲這些序列文件的內容到一個文本文件，你可以使用mahout vectordump效用類似如下：

mahout vectordump -i /path/to/doc_topic_seq_input -o /path/to/doc_topic_text_out -p true -c csv

其中：

-i Path to input directory containing document-topic distribution in sequence file format. 
-o Path to output file that will contain your document-topic distribution in text format. 
-p Key values will be displayed if this parameter is used. 
-c Output the Vector as CSV, otherwise it substitutes in the terms for vector cell entries

來源

2013-11-19 23:42:53 Wesam

使用Mahout來訓練LDA並檢索它的主題

回答

相關問題