我目前正在嘗試在火花簇上執行LDA。我有一個RDD這樣Pyspark mllib LDA錯誤:對象無法轉換爲java.util.List
>>> myRdd.take(2)
[(218603, [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]), (95680, [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0])]
但調用
model = LDA.train(myRdd, k=5, seed=42)
給出了勞動者有下列錯誤:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5874.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5874.0): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.List
我不知道如何從明顯的一邊解釋這個錯誤,所以任何意見,將不勝感激;在mllib的LDA的文檔是相當稀少
我從下面的過程中獲得RDD,與具有「doc_label」和「術語」
hashingTF = HashingTF(inputCol="terms", outputCol="term_frequencies", numFeatures=10)
tf_matrix = hashingTF.transform(document_instances)
myRdd = tf_matrix.select("doc_label", "term_frequencies").rdd
使用此列的數據幀document_instances
開始直接給出了同樣的錯誤。現在,這是使用HashingTF在pyspark.ml.feature,所以我懷疑可能會有一個衝突導致mllib與矢量ml之間的區別,但直接使用Vector.fromML()函數的映射給出了同樣的錯誤使用
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, old_row.term_frequencies.toArray().tolist()))
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, old_row.term_frequencies.toArray()))
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, Vectors.fromML(old_row.term_frequencies)))
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, old_row.term_frequencies))