2017-05-30 89 views
1

我有一個數據幀,其這樣下去:移調火花

+---------+-------------+--------------------+--------+ 
|  ID |  reg_num|    reg_typ|reg_code| 
+---------+-------------+--------------------+--------+ 
|523528690| 134886307000|Chamber of Commer | 14246| 
|523528690|2015/369956|Government Gazett | 14225| 
|523528690| 997253630|Tax Registration | 14259| 
|523528691| 997253633|Tax Doc    | 14250| 
|523528691| 997253634|Tax File   | 14251| 
|523528691| 997253635|Tax Data   | 14252| 
|523528691| 997253636|Tax Monitor   | 14253| 
+---------+-------------+--------------------+--------+ 

現在我試圖實現與格式輸出:

+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+ 
|  ID |  reg_num|    reg_typ|reg_code|  reg_1 |  reg_2 |  reg_3 |  reg_4 | 
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+ 
|523528690| 134886307000|Chamber of Commer | 14246| 134886307000|2015/369956| 997253630 | null  | 
|523528690|2015/369956|Government Gazett | 14225|134886307000 |2015/369956|997253630 |null   | 
|523528690| 997253630|Tax Registration | 14259| 134886307000|2015/369956| 997253630 | null  | 
|523528691| 997253633|Tax Doc    | 14250| 997253633| 997253634| 997253635| 997253636| 
|523528691| 997253634|Tax File   | 14251| 997253633| 997253634| 997253635| 997253636| 
|523528691| 997253635|Tax Data   | 14252| 997253633| 997253634| 997253635| 997253636| 
|523528691| 997253636|Tax Monitor   | 14253| 997253633| 997253634| 997253635| 997253636| 
+---------+-------------+--------------------+--------+-------------+-------------+-------------+-------------+ 

我所看到的預定義功能像樞軸,但它似乎不適合我的情況。

我使用Spark版本1.6和Scala版本2.10.5。

幫助appriciated!

+0

@eliasah該解決方案解決了這個問題,並根據需要進行。謝謝:) – Svk

+0

很高興聽到! – eliasah

+0

@eliasah只是一個問題,當我試圖通過一個大型數據集時,reg_1,.. reg_4列的排列不是按照原始數據框中的順序排列的,因爲在第1個reg_num不對應於reg_1。是否因爲窗口函數正在使用order by子句? – Svk

回答

2

支點是要走的路,但它背後的邏輯並不明顯:

import org.apache.spark.sql.expressions.Window 

val df = Seq(
    (523528690, "134886307000", "Chamber of Commer", 14246), 
    (523528690, "2015/369956", "Government Gazett", 14225), 
    (523528690, "997253630", "Tax Registration", 14259), 
    (523528691, "997253633", "Tax Doc", 14250), 
    (523528691, "997253634", "Tax File", 14251), 
    (523528691, "997253635", "Tax Data", 14252), 
    (523528691, "997253636", "Tax Monitor", 14253)).toDF("id", "reg_num", "reg_type", "reg_code") 

val w = Window.partitionBy("id").orderBy("reg_num") 
df.show 
// +---------+-------------+-----------------+--------+ 
// |  id|  reg_num|   reg_type|reg_code| 
// +---------+-------------+-----------------+--------+ 
// |523528690| 134886307000|Chamber of Commer| 14246| 
// |523528690|2015/369956|Government Gazett| 14225| 
// |523528690| 997253630| Tax Registration| 14259| 
// |523528691| 997253633|   Tax Doc| 14250| 
// |523528691| 997253634|   Tax File| 14251| 
// |523528691| 997253635|   Tax Data| 14252| 
// |523528691| 997253636|  Tax Monitor| 14253| 
// +---------+-------------+-----------------+--------+ 


val df2 = df.join(df.withColumn("rn", row_number.over(w)).groupBy("id").pivot("rn").agg(first("reg_num")), Seq("id")) 
df2.show 
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+ 
// |  id|  reg_num|   reg_type|reg_code|   1|   2|  3|  4| 
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+ 
// |523528690| 134886307000|Chamber of Commer| 14246|134886307000|2015/369956|997253630|  null| 
// |523528690|2015/369956|Government Gazett| 14225|134886307000|2015/369956|997253630|  null| 
// |523528690| 997253630| Tax Registration| 14259|134886307000|2015/369956|997253630|  null| 
// |523528691| 997253633|   Tax Doc| 14250| 997253633| 997253634|997253635|997253636| 
// |523528691| 997253634|   Tax File| 14251| 997253633| 997253634|997253635|997253636| 
// |523528691| 997253635|   Tax Data| 14252| 997253633| 997253634|997253635|997253636| 
// |523528691| 997253636|  Tax Monitor| 14253| 997253633| 997253634|997253635|997253636| 
// +---------+-------------+-----------------+--------+------------+-------------+---------+---------+