2017-06-21 64 views
0

我試圖從數據框中創建LabeledPoint的RDD,因此我可以稍後將它用於MlLib。pyspark MlLib:排除行中的列值

下面的代碼工作正常,如果my_target列是sparkDF中的第一列。但是,如果my_target列不是第一列,那麼如何修改以下代碼以排除my_target以創建正確的LabeledPoint?

import pyspark.mllib.classification as clf 
labeledData = sparkDF.rdd.map(lambda row: clf.LabeledPoint(row['my_target'],row[1:])) 

logRegr = clf.LogisticRegressionWithSGD.train(labeledData) 

即,row[1:]現在排除第一列中的值;如果我想排除列的N列中的值,那麼我該怎麼做?謝謝!

回答

1
>>> a = [(1,21,31,41),(2,22,32,42),(3,23,33,43),(4,24,34,44),(5,25,35,45)] 
>>> df = spark.createDataFrame(a,["foo","bar","baz","bat"]) 
>>> df.show() 
+---+---+---+---+ 
|foo|bar|baz|bat| 
+---+---+---+---+ 
| 1| 21| 31| 41| 
| 2| 22| 32| 42| 
| 3| 23| 33| 43| 
| 4| 24| 34| 44| 
| 5| 25| 35| 45| 
+---+---+---+---+ 

>>> N = 2 
# N is the column that you want to exclude (in this example the third, indexing starts at 0) 
>>> labeledData = df.rdd.map(lambda row: LabeledPoint(row['foo'],row[:N]+row[N+1:])) 
# it is just a concatenation with N that is excluded both in row[:N] and row[N+1:] 

>>> labeledData.collect() 
[LabeledPoint(1.0, [1.0,21.0,41.0]), LabeledPoint(2.0, [2.0,22.0,42.0]), LabeledPoint(3.0, [3.0,23.0,43.0]), LabeledPoint(4.0, [4.0,24.0,44.0]), LabeledPoint(5.0, [5.0,25.0,45.0])]