2016-09-18 67 views
0

我有以下RDD,每個記錄(BIGINT,載體)的元組:pyspark:擴大DenseVector到元組到RDD

myRDD.take(5) 

[(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])), 
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])), 
(0, DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0])), 
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])), 
(1, DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432]))] 

如何展開密集的載體,使其一部分一個元組?即我希望以上成爲:

[(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432), 
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432), 
(0, 5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0), 
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432), 
(1, 9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432)] 

謝謝!

+1

提示:'Vector'是可迭代的。其他一切都是一個基本的Python(參數拆包可能是有用的,但不是必需的)。 – zero323

+0

謝謝zero323!我嘗試newRDD = myRDD.map(lambda x:(x [0],tuple(x [1]))),它確實將DenseVector展開爲一個元組,但我仍然在元組內部找到一個元組,如:(1, (1,9.2463,1.0,0.392,0.3381,162.6437,7.9432)),這個嵌套元組變成一個元組的任何提示?謝謝! – Edamame

回答

1

好吧,既然pyspark.ml.linalg.DenseVector(或mllib)是iterbale(提供__len____getitem__方法),你可以把它像任何其他的Python的集合,例如:

def as_tuple(kv): 
    """ 
    >>> as_tuple((1, DenseVector([9.25, 1.0, 0.31, 0.31, 162.37]))) 
    (1, 9.25, 1.0, 0.31, 0.31, 162.37) 
    """ 
    k, v = kv 
    # Use *v.toArray() if you want to support Sparse one as well. 
    return (k, *v) 

對於Python 2取代:

(k, *v) 

有:

from itertools import chain 

tuple(chain([k], v)) 

或:

(k,) + tuple(v) 

如果你想值轉換到Python(未NumPy的)標量使用:代替v

v.toArray().tolist() 

+0

'k,v = kv'是拆包的結構。你可以使用'kv [0]','kv [1]'代替,但如果發現它更優雅,更容易閱讀。 – zero323