Pyspark：將RDD轉換爲RowMatrix

我有一個RDD窗體（id1，id2，score）。頂部（5）行看起來像Pyspark：將RDD轉換爲RowMatrix

[(41955624, 42044497, 3.913625989045223e-06), 
(41955624, 42039940, 0.0001018890937469129), 
(41955624, 42037797, 7.901647831291928e-05), 
(41955624, 42011137, -0.00016191403038589588), 
(41955624, 42006663, -0.0005302800991148567)]

我想根據分數計算id2成員之間的相似度。我想使用RowMatrix.columnSimilarity，但我需要先將它轉換爲RowMatrix。我希望矩陣的結構爲id1 x id2 - 即，使id爲id1外的行id和id2外的列id。

如果我的數據是小我可以把它轉換成數據幀Pyspark然後用旋轉像

rdd_df.groupBy("id1").pivot("id2").sum("score")

但有超過10,000個不同的ID2 borks，我有比這更多。

天真 rdd_Mat = la.RowMatrix（紅色）帶來的數據作爲3列矩陣，這不是我想要的。

非常感謝。

來源

2017-08-10 efreeman

數據的結構更類似於CoordinateMatrix的結構，它基本上是RDD的元組的封裝。正因爲如此，您可以輕鬆地從您現有的RDD創建CoordinetMatrix。

from pyspark.mllib.linalg.distributed import CoordinateMatrix 

cmat=CoordinateMatrix(yourRDD)

此外，因爲您最初問了RowMatrix，pyspark提供了一種輕鬆矩陣類型之間轉換：給你想要的RowMatrix

rmat=cmat.toRowMatrix()

。

來源

2017-08-10 23:28:58 DavidWayne

謝謝。我發現我不得不做一箇中間步驟，將ID轉換成連續的整數，以避免製作40毫米柱的矩陣。 – efreeman

不客氣。如果此答案已解決您的問題，請考慮通過點擊複選標記來接受此問題。沒有義務。 – DavidWayne

Pyspark：將RDD轉換爲RowMatrix

回答

相關問題