如何原話重視LDA結果pyspark數據幀設置

我有一個LDA的一個pyspark數據幀像這樣的結果：如何原話重視LDA結果pyspark數據幀設置

topicIndices.filter("topic > 3").show(10, truncate=True) 
+-----+--------------------+--------------------+ 
|topic|   termIndices|   termWeights| 
+-----+--------------------+--------------------+ 
| 4| [27, 56, 29, 46, 6]|[0.01826416604834...| 
| 5| [63, 4, 36, 31, 21]|[0.01900143131755...| 
| 6|[40, 60, 16, 36, 50]|[0.01915052744093...| 
| 7| [5, 59, 4, 8, 29]|[0.05513279495368...| 
| 8| [52, 17, 10, 46, 2]|[0.01903217569516...| 
| 9|  [0, 1, 3, 7, 6]|[0.13563252276342...| 
+-----+--------------------+--------------------+

我想的話代替的名詞索引以檢查主題。我所試圖做的是：

topics = topicIndices \ 
    .rdd \ 
    .map(lambda x: vocabList[y] for y in x[1].zip(x[2]))

，但我得到的錯誤：

NameError: name 'x' is not defined

什麼我錯在這裏做什麼？

實際上，這是Python版本的這個Scala代碼：

val topics = topicIndices.map { case (terms, termWeights) => 
       terms.map(vocabList(_)).zip(termWeights) 
      }

從this dataBricks post

來源

2017-11-17 user299791

你lambda的表述應該是進入括號，即：

topics = topicIndices \ 
    .rdd \ 
    .map(lambda x: (vocabList[y] for y in x[1].zip(x[2])))

UPDATE（後評論）：你很顯然試圖使用PySpark zip，但是作爲RDD的參數，而不是列表。我猜（因爲你沒有提供你想要的結果的一個例子，更不用說vocabList功能本身），你需要的standard Python zip function，具有不同的用途：

topics = topicIndices \ 
    .rdd \ 
    .map(lambda x: (vocabList[y] for y in zip(x[1],x[2])))

來源

2017-11-18 09:24:39 desertnaut

嗯，這樣我得到AttributeError的：「 list'object has no attribute'zip' – user299791

@ user299791您的OP中報告的錯誤是由於您的部分在標準Python使用中出現錯誤，因此我的答案是;可以說，你現在更接近你真正的問題; ['zip']（http://spark.apache.org/docs/2.1.1/api/python/pyspark.html#pyspark.RDD.zip）適用於RDD，而不適用於像'x [1 ]'＆'x [2]'。請接受這個答案，並打開一個新問題，其中包括**期望結果的例子**。 – desertnaut

你已經看到了編輯問題，我報告了我試圖移植的Scala代碼，不是嗎？ – user299791

如何原話重視LDA結果pyspark數據幀設置

回答

相關問題