如何在pyspark的TF-IDF Dataframe上應用SVD

我已經應用了pyspark tf-idf函數並獲得以下結果。如何在pyspark的TF-IDF Dataframe上應用SVD

| features | 
|----------| 
| (35,[7,9,11,12,19,26,33],[1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003,1.6094379124341003,1.6094379124341003,1.6094379124341003]) | 
| (35,[0,2,4,5,6,11,22],[0.9162907318741551,0.9162907318741551,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003]) |

因此，一個數據幀有1列（功能），其中包含SparseVectors行。

現在我想從這個數據幀建立IndexRowMatrix，這樣我可以運行它在這裏https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=svd#pyspark.mllib.linalg.distributed.IndexedRowMatrix.computeSVD

描述我曾嘗試以下，但沒有工作的SVD功能：

mat = RowMatrix(tfidfData.rdd.map(lambda x: x.features)) 

TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

我使用RowMatrix是因爲構建它，我不必提供元組，但我甚至無法構建RowMatrix。 IndexedRowMatrix對我來說會更困難。

那麼如何在pyspark上輸出tf-idf數據幀的IndexedRowMatrix？

來源

2017-09-20 Abdullah

我能解決它。因爲錯誤提示RowMatrix將不會接受pyspark.ml.linalg.SparseVector載體，所以我將此載體轉換爲pyspark.mllib.linalg請注意ml和mllib。現在，下面是將TF-IDF輸出轉換爲RowMatrix的代碼片段，並將computeSVD方法應用於它。

from pyspark.mllib.linalg import Vectors 
mat = RowMatrix(df.rdd.map(lambda v: Vectors.dense(v.rawFeatures.toArray())))

我已經轉化爲稠密矩陣，但你可以寫代碼一些額外的行ml.linalg.SparseVector轉化爲mllib.linalg.SparseVector

來源

2017-09-20 21:36:52 Abdullah

如何在pyspark的TF-IDF Dataframe上應用SVD

回答

相關問題