pySpark Columns相似性問題

tl; dr 如何使用pySpark比較行的相似性？pySpark Columns相似性問題

我有一個numpy的陣列，我想每一行的相似之處彼此比較

print (pdArray) 
#[[ 0. 1. 0. ..., 0. 0. 0.] 
# [ 0. 0. 3. ..., 0. 0. 0.] 
# [ 0. 0. 0. ..., 0. 0. 7.] 
# ..., 
# [ 5. 0. 0. ..., 0. 1. 0.] 
# [ 0. 6. 0. ..., 0. 0. 3.] 
# [ 0. 0. 0. ..., 2. 0. 0.]]

使用SciPy的我可以計算餘弦相似之處遵循...

pyspark.__version__ 
# '2.2.0' 

from sklearn.metrics.pairwise import cosine_similarity 
similarities = cosine_similarity(pdArray) 

similarities.shape 
# (475, 475) 

print(similarities) 
array([[ 1.00000000e+00, 1.52204908e-03, 8.71545594e-02, ..., 
      3.97681174e-04, 7.02593036e-04, 9.90472253e-04], 
     [ 1.52204908e-03, 1.00000000e+00, 3.96760121e-04, ..., 
      4.04724413e-03, 3.65324300e-03, 5.63519735e-04], 
     [ 8.71545594e-02, 3.96760121e-04, 1.00000000e+00, ..., 
      2.62367141e-04, 1.87878869e-03, 8.63876439e-06], 
     ..., 
     [ 3.97681174e-04, 4.04724413e-03, 2.62367141e-04, ..., 
      1.00000000e+00, 8.05217639e-01, 2.69724702e-03], 
     [ 7.02593036e-04, 3.65324300e-03, 1.87878869e-03, ..., 
      8.05217639e-01, 1.00000000e+00, 3.00229809e-03], 
     [ 9.90472253e-04, 5.63519735e-04, 8.63876439e-06, ..., 
      2.69724702e-03, 3.00229809e-03, 1.00000000e+00]])

由於我正在尋找擴大到比我原來的（475行）矩陣更大的集，我正在通過pySpark使用Spark觀看

from pyspark.mllib.linalg.distributed import RowMatrix 

#load data into spark 
tempSpark = sc.parallelize(pdArray) 
mat = RowMatrix(tempSpark) 

# Calculate exact similarities 
exact = mat.columnSimilarities() 

exact.entries.first() 
# MatrixEntry(128, 211, 0.004969676943490767) 

# Now when I get the data out I do the following... 
# Convert to a RowMatrix. 
rowMat = approx.toRowMatrix() 
t_3 = rowMat.rows.collect() 
a_3 = np.array([(x.toArray()) for x in t_3]) 
a_3.shape 
# (488, 749)

正如你所看到的，數據的形狀是a）不再是方形的（它應該是和b）的尺寸與原始行數不匹配......現在它確實匹配（在部分_中的特徵數量在每一行（len（pdArray [0]）= 749），但我不知道488是從哪裏來的

749的存在讓我覺得我需要先調換我的數據。那是對的嗎？

最後，如果是這種情況，爲什麼尺寸不是（749,749）？

來源

2017-08-07 Chris Arthur

稀疏向量爲此顯示多少行rowMat.rows.collect（）？ – Suresh

首先，columnSimilarities方法只返回相似性矩陣的上三角部分的關閉對角條目。由於缺少沿對角線的1，所以在結果相似度矩陣中可能有0個整行。

其次，一個pyspark RowMatrix沒有有意義的行索引。所以基本上，當從CoordinateMatrix轉換爲RowMatrix時，MatrixEntry中的i值被映射爲任何方便的值（可能是某個增量索引）。因此，可能發生的情況是，將所有0的行簡單地忽略，並且矩陣在將其轉換爲RowMatrix時垂直壓扁。

在用columnSimilarities方法計算後立即檢查相似度矩陣的維數可能是有意義的。您可以通過使用numRows()和numCols()方法來完成此操作。

print(exact.numRows(),exact.numCols())

除此之外，它聽起來像是需要轉置矩陣以獲得正確的向量相似性。此外，如果您在某種類似於RowMatrix的表單中存在某些原因，則可以嘗試使用具有有意義的行索引的IndexedRowMatrix，並在轉換時保留原始CoordinateMatrix的行索引。

來源

2017-08-07 15:01:01 DavidWayne

pySpark Columns相似性問題

回答

相關問題