推斷標記的LDA/pLDA [主題建模工具箱]

我一直試圖通過訓練標記LDA模型和使用TMT工具箱（斯坦福nlp組）的pLDA進行推理的代碼。我已經通過以下鏈接提供的例子了： http://nlp.stanford.edu/software/tmt/tmt-0.3/ http://nlp.stanford.edu/software/tmt/tmt-0.4/推斷標記的LDA/pLDA [主題建模工具箱]

這裏是我想要的標記LDA推斷代碼

val modelPath = file("llda-cvb0-59ea15c7-31-61406081-75faccf7"); 

val model = LoadCVB0LabeledLDA(modelPath);` 

val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1); 

val text = { 
    source ~>        // read from the source file 
    Column(4) ~>       // select column containing text 
    TokenizeWith(model.tokenizer.get)  //tokenize with model's tokenizer 
} 

val labels = { 
    source ~>        // read from the source file 
    Column(2) ~>       // take column two, the year 
    TokenizeWith(WhitespaceTokenizer())  
} 

val outputPath = file(modelPath, source.meta[java.io.File].getName.replaceAll(".csv","")); 

val dataset = LabeledLDADataset(text,labels,model.termIndex,model.topicIndex); 

val perDocTopicDistributions = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset); 

val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions); 

TSVFile(outputPath+"-word-topic-distributions.tsv").write({ 
    for ((terms,(dId,dists)) <- text.iterator zip perDocTermTopicDistributions.iterator) yield { 
    require(terms.id == dId); 
    (terms.id, 
    for ((term,dist) <- (terms.value zip dists)) yield { 
     term + " " + dist.activeIterator.map({ 
     case (topic,prob) => model.topicIndex.get.get(topic) + ":" + prob 
     }).mkString(" "); 
    }); 
    } 
});

錯誤

found : scalanlp.collection.LazyIterable[(String, Array[Double])] required: Iterable[(String, scalala.collection.sparse.SparseArray[Double])] EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

我知道這是一種類型不匹配錯誤。但我不知道如何解決這個scala。基本上我不明白我應該如何提取 1.根據doc主題分佈 2.根據推斷命令輸出後的doc標籤分佈。

請幫忙。與pLDA相同。我到達了推理命令，然後無能爲力。

來源

2012-07-28 Rohit Jain

Scala類型系統比Java更加複雜，理解它會讓你成爲更好的程序員。問題就出在這裏：

val perDocTermTopicDistributions =EstimateLabeledLDAPerWordTopicDistributions(model, dataset, perDocTopicDistributions);

，因爲無論模型或數據集或perDocTopicDistributions是類型：

scalanlp.collection.LazyIterable[(String, Array[Double])]

而EstimateLabeledLDAPerWordTopicDistributions.apply需要一個

Iterable[(String, scalala.collection.sparse.SparseArray[Double])]

調查的最佳方式這種類型的錯誤是看ScalaDoc（例如tmt的那個：http://nlp.stanford.edu/software/tmt/tmt-0.4/api/#package），如果你找不到問題出在哪裏ea甲硅烷，你應該明確你的變量的代碼裏面一樣的類型如下：

val perDocTopicDistributions:LazyIterable[(String, Array[Double])] = InferCVB0LabeledLDADocumentTopicDistributions(model, dataset)

如果我們一起來看一下，以edu.stanford.nlp.tmt.stage的javadoc的：

def 
EstimateLabeledLDAPerWordTopicDistributions (model: edu.stanford.nlp.tmt.model.llda.LabeledLDA[_, _, _], dataset: Iterable[LabeledLDADocumentParams], perDocTopicDistributions: Iterable[(String, SparseArray[Double])]): LazyIterable[(String, Array[SparseArray[Double]])] 

def 
InferCVB0LabeledLDADocumentTopicDistributions (model: CVB0LabeledLDA, dataset: Iterable[LabeledLDADocumentParams]): LazyIterable[(String, Array[Double])]

它現在應該清楚，InferCVB0LabeledLDADocumentTopicDistributions的返回不能直接用於饋送EstimateLabeledLDAPerWordTopicDistributions。

我從來沒有使用斯坦福nlp，但這是由api如何工作，所以你只需要在調用函數之前將scalanlp.collection.LazyIterable[(String, Array[Double])]轉換爲Iterable[(String, scalala.collection.sparse.SparseArray[Double])]。

如果你看scaladoc關於如何做這個轉換，這很簡單。在包裝階段，在包裝內。斯卡拉我可以讀import scalanlp.collection.LazyIterable;

所以我知道在哪裏看，實際上裏面http://www.scalanlp.org/docs/core/data/#scalanlp.collection.LazyIterable你有變成一個LazyIterable到可迭代一個toIterable方法，還是你有你的內部數組轉換成SparseArray

再次，我期待到package.scala爲舞臺包裝內TMT，我看到：import scalala.collection.sparse.SparseArray;我找scalala文檔：

http://www.scalanlp.org/docs/scalala/0.4.1-SNAPSHOT/#scalala.collection.sparse.SparseArray

事實證明，構造函數看似複雜到我，所以它的聲音這很像我不得不查看工廠方法的伴侶對象。事實證明，我正在尋找的方法在那裏，它被稱爲像往常一樣適用於斯卡拉。

def 
apply [T] (values: T*)(implicit arg0: ClassManifest[T], arg1: DefaultArrayValue[T]): SparseArray[T]

利用這一點，你可以編寫具有以下簽名的函數：

def f: Array[Double] => SparseArray[Double]

一旦這項工作完成後，你可以把你的InferCVB0LabeledLDADocumentTopicDistributions結果到了一個非延遲迭代稀疏陣列與一行代碼：

result.toIterable.map { case (name, values => (name, f(values)) }

來源

2012-08-03 09:57:20 Edmondo1984

推斷標記的LDA/pLDA [主題建模工具箱]

回答

相關問題