在Spark中，如何使用SparseVector將DataFrame轉換爲RDD [Vector]？

正在關注this example我爲某些文檔計算了TF-IDF權重。現在我想用RowMatrix來計算文件的相似度。但我無法將數據轉換爲正確的格式。我現在所擁有的是一個DataFrame，它的行具有（String，SparseVector）作爲兩列的類型。我應該將其轉換爲RDD[Vector]，我認爲將是一樣簡單：在Spark中，如何使用SparseVector將DataFrame轉換爲RDD [Vector]？

features.map(row => row.getAs[SparseVector](1)).rdd()

但我得到這個錯誤：

<console>:58: error: Unable to find encoder for type stored in a 
Dataset. Primitive types (Int, String, etc) and Product types (case 
classes) are supported by importing spark.implicits._ Support for 
serializing other types will be added in future releases.

導入spark.implicits._沒什麼區別。

那麼這是怎麼回事？我很驚訝Spark不知道如何編碼自己的矢量數據類型。

來源

2017-10-11 Josh Hansen

只需在map之前轉換爲RDD即可。

import org.apache.spark.ml.linalg._ 

val df = Seq((1, Vectors.sparse(1, Array(), Array()))).toDF 

df.rdd.map(row => row.getAs[Vector](1))

來源

2017-10-11 22:07:52 user8371915

在Spark中，如何使用SparseVector將DataFrame轉換爲RDD [Vector]？

回答

相關問題