val df1 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds1"),
("a1",20,"ACTIVE","ds1"),
("a2",50,"ACTIVE","ds1"),
("a3",60,"ACTIVE","ds1"))
).toDF("c1","c2","c3","c4")`
val df2 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds2"),
("a1",20,"ACTIVE","ds2"),
("a1",30,"ACTIVE","ds2"),
("a1",40,"ACTIVE","ds2"),
("a4",20,"ACTIVE","ds2"))
).toDF("c1","c2","c3","c5")`
df1.show()
// +---+---+------+---+
// | c1| c2| c3| c4|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds1|
// | a1| 20|ACTIVE|ds1|
// | a2| 50|ACTIVE|ds1|
// | a3| 60|ACTIVE|ds1|
// +---+---+------+---+
df2.show()
// +---+---+------+---+
// | c1| c2| c3| c5|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds2|
// | a1| 20|ACTIVE|ds2|
// | a1| 30|ACTIVE|ds2|
// | a1| 40|ACTIVE|ds2|
// | a4| 20|ACTIVE|ds2|
// +---+---+------+---+
我的要求是:我需要連接兩個數據幀。 我的輸出數據幀應該包含來自df1的所有記錄以及來自df2的所有記錄,這些記錄不僅僅適用於匹配「c1」的df1。我從df2中提取的記錄應該在列「c3」處更新爲「非活動」。如何連接兩個DataFrame並更改缺少值的列?
在這個例子中,只有「c1」的匹配值是a1。所以我需要從df2中取出c2 = 30和40條記錄,並使它們成爲非活動狀態。
這裏是輸出。
df_output.show()
// +---+---+--------+---+
// | c1| c2| c3 | c4|
// +---+---+--------+---+
// | a1| 10|ACTIVE |ds1|
// | a1| 20|ACTIVE |ds1|
// | a2| 50|ACTIVE |ds1|
// | a3| 60|ACTIVE |ds1|
// | a1| 30|INACTIVE|ds1|
// | a1| 40|INACTIVE|ds1|
// +---+---+--------+---+
任何人都可以幫助我做到這一點。
對於INACTIVE記錄,c4值是否從ds2更改爲ds1? – Pushkr