2017-08-11 54 views
0

我有兩個數據幀,的GroupBy

Dataframe1包含鍵/值對:

+------+-----------------+  
| Key | Value   | 
+------+-----------------+ 
| key1 | Column1   | 
+------+-----------------+ 
| key2 | Column2   | 
+------+-----------------+ 
| key3 | Column1,Column3 | 
+------+-----------------+ 

第二數據幀:

這是實際的數據框,我需要申請GROUPBY操作

+---------+---------+---------+--------+ 
| Column1 | Column2 | Column3 | Amount | 
+---------+---------+---------+--------+ 
| A  | A1  | XYZ  | 100 | 
+---------+---------+---------+--------+ 
| A  | A1  | XYZ  | 100 | 
+---------+---------+---------+--------+ 
| A  | A2  | XYZ  | 10  | 
+---------+---------+---------+--------+ 
| A  | A3  | PQR  | 100 | 
+---------+---------+---------+--------+ 
| B  | B1  | XYZ  | 200 | 
+---------+---------+---------+--------+ 
| B  | B2  | PQR  | 280 | 
+---------+---------+---------+--------+ 
| B  | B3  | XYZ  | 20  | 
+---------+---------+---------+--------+ 

Dataframe1包含鍵值列 它採取從dataframe1的鑰匙,它必須採取相應的值,並做了dataframe2的GROUPBY操作

Dframe= df.groupBy($"key").sum("amount").show() 

預期輸出:基於在數據幀的鍵第三dataframes

d1= df.grouBy($"key1").sum("amount").show() 

它必須是:df.grouBy($"column1").sum("amount").show()

+---+-----+ 
| A | 310 | 
+---+-----+ 
| B | 500 | 
+---+-----+ 

代碼:

d2=df.groupBy($"key2").sum("amount").show() 

result: df.grouBy($"column2").sum("amount").show() 

數據框:

+----+-----+ 
| A1 | 200 | 
+----+-----+ 
| A2 | 10 | 
+----+-----+ 

代碼:

d3.df.groupBy($"key3").sum("amount").show() 

數據框:

+---+-----+-----+ 
| A | XYZ | 320 | 
+---+-----+-----+ 
| A | PQR | 10 | 
+---+-----+-----+ 
| B | XYZ | 220 | 
+---+-----+-----+ 
| B | PQR | 280 | 
+---+-----+-----+ 

在未來,如果我增加更多的按鍵,它具有顯示數據框。有人能幫我嗎。

回答

2

鑑於鍵值數據框爲(我建議你不要形成從源數據數據幀,原因如下)

+----+---------------+ 
|Key |Value   | 
+----+---------------+ 
|key1|Column1  | 
|key2|Column2  | 
|key3|Column1,Column3| 
+----+---------------+ 

和實際數據幀作爲

+-------+-------+-------+------+ 
|Column1|Column2|Column3|Amount| 
+-------+-------+-------+------+ 
|A  |A1  |XYZ |100 | 
|A  |A1  |XYZ |100 | 
|A  |A2  |XYZ |10 | 
|A  |A3  |PQR |100 | 
|B  |B1  |XYZ |200 | 
|B  |B2  |PQR |280 | 
|B  |B3  |XYZ |20 | 
+-------+-------+-------+------+ 

我會建議您不要將第一個數據幀轉換爲rdd地圖,因爲

val maps = df1.rdd.map(row => row(0) -> row(1)).collect() 

然後循環地圖S作爲

import org.apache.spark.sql.functions._ 
for(kv <- maps){ 
    df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false) 
    //you can store the results in separate dataframes or write them to files or database 
} 

你應該follwing輸出

+-------+-----------+ 
|Column1|sum(Amount)| 
+-------+-----------+ 
|B  |500  | 
|A  |310  | 
+-------+-----------+ 

+-------+-----------+ 
|Column2|sum(Amount)| 
+-------+-----------+ 
|A2  |10   | 
|B2  |280  | 
|B1  |200  | 
|B3  |20   | 
|A3  |100  | 
|A1  |200  | 
+-------+-----------+ 

+-------+-------+-----------+ 
|Column1|Column3|sum(Amount)| 
+-------+-------+-----------+ 
|B  |PQR |280  | 
|B  |XYZ |220  | 
|A  |PQR |100  | 
|A  |XYZ |210  | 
+-------+-------+-----------+ 
+0

感謝您的答覆,這是我在尋找什麼。我可以聯合所有的數據框? – prapthi

+1

聯合數據框的列將是什麼?對於聯合,所有數據框的列號應該相同。 –

+0

有沒有可能我可以只爲column1和column2執行unionall? – prapthi