PySpark：我如何再加入一列到數據框？

我正在使用兩個inicial列的數據幀，id和colA。PySpark：我如何再加入一列到數據框？

+---+-----+ 
|id |colA | 
+---+-----+ 
| 1 | 5 | 
| 2 | 9 | 
| 3 | 3 | 
| 4 | 1 | 
+---+-----+

我需要合併該數據幀到另一列以上，COLB。我知道colB非常適合DataFrame的末尾，我只需要一些方法將它們連接在一起。

+-----+ 
|colB | 
+-----+ 
| 5 | 
| 9 | 
| 3 | 
| 1 | 
+-----+

在這些結果，我需要獲得一個新的數據幀像下面：

+---+-----+-----+ 
|id |colA |colB | 
+---+-----+-----+ 
| 1 | 5 | 8 | 
| 2 | 9 | 7 | 
| 3 | 3 | 0 | 
| 4 | 1 | 6 | 
+---+-----+-----+

這是pyspark代碼，以獲得第一數據幀：

l=[(1,5),(2,9), (3,3), (4,1)] 
names=["id","colA"] 
db=sqlContext.createDataFrame(l,names) 
db.show()

哪有我這麼做？有誰能幫助我嗎？謝謝

來源

2017-10-18 Thaise

不能任意列在星火添加到數據幀 - 在這裏看到了廣泛的答案：https://stackoverflow.com/questions/33681487/how-do-i-add-a - 新列火花數據框使用pyspark – desertnaut

可能重複[如何添加一個新的列到Spark DataFrame（使用PySpark）？]（https://stackoverflow.com/questions/33681487/how-do-i-add-a-new-column-to-a-spark-dataframe-using-pyspark） – desertnaut

我做了！我已經通過添加一個具有行索引的臨時列來解決它，然後刪除它。

代碼：

from pyspark.sql import Row 
from pyspark.sql.window import Window 
from pyspark.sql.functions import rowNumber 
w = Window().orderBy() 

l=[(1,5),(2,9), (3,3), (4,1)] 
names=["id","colA"] 
db=sqlContext.createDataFrame(l,names) 
db.show() 

l=[5,9,3,1] 
rdd = sc.parallelize(l).map(lambda x: Row(x)) 
test_df = rdd.toDF() 
test_df2 = test_df.selectExpr("_1 as colB") 
dbB = test_df2.select("colB") 

db= db.withColum("columnindex", rowNumber().over(w)) 
dbB = dbB.withColum("columnindex", rowNumber().over(w)) 


testdf_out = db.join(dbB, db.columnindex == dbB.columnindex. 'inner').drop(db.columnindex).drop(dbB.columnindex) 
testdf_out.show()

來源

2017-10-18 13:23:02 Thaise

您可以使用monotonically_increasing_id直接爲每個數據框創建臨時索引並加入它們。同時檢查。 – Suresh

PySpark：我如何再加入一列到數據框？

回答

相關問題