2015-11-05 166 views
0

合併來自不同dataframes行一起例如第一我有這樣在斯卡拉

+----+-----+-----+--------------------+-----+ 
|year| make|model|    comment|blank| 
+----+-----+-----+--------------------+-----+ 
|2012|Tesla| S|   No comment|  | 
|1997| Ford| E350|Go get one now th...|  | 
|2015|Chevy| Volt|    null| null| 
+----+-----+-----+--------------------+-----+ 

我們有2012年,1997年和2015年將數據幀,我們有另一個數據幀這樣

+----+-----+-----+--------------------+-----+ 
|year| make|model|    comment|blank| 
+----+-----+-----+--------------------+-----+ 
|2012|BMW | 3|   No comment|  | 
|1997|VW | GTI | get    |  | 
|2015|MB | C200|    good| null| 
+----+-----+-----+--------------------+-----+ 

我們還有2012年,1997年,2015年。我們如何將同一年的行合併在一起?由於

輸出應該是這樣的

+----+-----+-----+--------------------+-----++-----+-----+--------------------------+ 
|year| make|model|    comment|blank|| make|model|    comment|blank| 
+----+-----+-----+--------------------+-----++-----+-----+-----+--------------------+ 
|2012|Tesla| S|   No comment|  |BMW | 3 |   no comment| 
|1997| Ford| E350|Go get one now th...|  |VW |GTI |  get   | 
|2015|Chevy| Volt|    null| null|MB |C200 |    Good |null 
+----+-----+-----+--------------------+-----++----+-----+-----+---------------------+ 

回答

1

你能得到你想要的表用一個簡單的join。喜歡的東西:

val joined = df1.join(df2, df1("year") === df2("year")) 

我裝你的輸入,例如,我看到以下內容:

scala> df1.show 
... 
year make model comment 
2012 Tesla S  No comment 
1997 Ford E350 Go get one now 
2015 Chevy Volt null 

scala> df2.show 
... 
year make model comment 
2012 BMW 3  No comment 
1997 VW GTI get 
2015 MB C200 good 

當我運行join,我得到:

scala> val joined = df1.join(df2, df1("year") === df2("year")) 
joined: org.apache.spark.sql.DataFrame = [year: string, make: string, model: string, comment: string, year: string, make: string, model: string, comment: string] 

scala> joined.show 
... 
year make model comment  year make model comment 
2012 Tesla S  No comment  2012 BMW 3  No comment 
2015 Chevy Volt null   2015 MB C200 good 
1997 Ford E350 Go get one now 1997 VW GTI get 

有一點要注意的是,您的列名稱可能不明確,因爲它們在數據框中被命名爲相同(因此您可以更改它們的名稱以使對結果數據框的操作更易於編寫)。

+0

Spark中有內連接,左連接,右連接還是全連接?謝謝 –