2016-10-01 37 views
0

如何轉置行使用RDD或不透視的數據幀列...星火:行至Colimns(移調的種類或樞軸)

SessionId,date,orig, dest, legind, nbr 

1 9/20/16,abc0,xyz0,o,1 
1 9/20/16,abc1,xyz1,o,2 
1 9/20/16,abc2,xyz2,i,3 
1 9/20/16,abc3,xyz3,i,4 

所以我要生成像新的模式:

SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4 

1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null 

邏輯是,如果:

  • NBR是1和legind = O然後orig1值(從第1行讀取)...

  • NBR是3和legind =我再DEST1值(從3行獲取)

那麼怎麼行轉來列...

任何想法將是巨大的讚賞。

試着用下面的選項,但它只是壓扁的所有單列..

val keys = List("SessionId"); 
val selectFirstValueOfNoneGroupedColumns = 
    df.columns 
    .filterNot(keys.toSet) 
    .map(_ -> "first").toMap 
val grouped = 
    df.groupBy(keys.head, keys.tail: _*) 
    .agg(selectFirstValueOfNoneGroupedColumns).show() 
+0

Weclome到SO。請花一些時間在[help](http://stackoverflow.com/help/how-to-ask)上獲得如何成熟的純文字,你可以問問題和格式整齊 –

+0

是的,它很簡單,它已經有了答案!按照我給的鏈接。 –

+0

@ TheArchetypalPaul:你提供的鏈接,它是不同的問題和方法。 – Ankur

回答

1

如果使用pivot功能是比較簡單的。首先讓我們創建一個數據集就像一個在你的問題:

import org.apache.spark.sql.functions.{concat, first, lit, when} 

val df = Seq(
    ("1", "9/20/16", "abc0", "xyz0", "o", "1"), 
    ("1", "9/20/16", "abc1", "xyz1", "o", "2"), 
    ("1", "9/20/16", "abc2", "xyz2", "i", "3"), 
    ("1", "9/20/16", "abc3", "xyz3", "i", "4") 
).toDF("SessionId", "date", "orig", "dest", "legind", "nbr") 

然後定義並附加輔助列:

// This will be the column name 
val key = when($"legind" === "o", concat(lit("orig"), $"nbr")) 
      .when($"legind" === "i", concat(lit("dest"), $"nbr")) 

// This will be the value 
val value = when($"legind" === "o", $"orig")  // If o take origin 
       .when($"legind" === "i", $"dest") // If i take dest 

val withKV = df.withColumn("key", key).withColumn("value", value) 

這將導致DataFrame這樣的:

+---------+-------+----+----+------+---+-----+-----+ 
|SessionId| date|orig|dest|legind|nbr| key|value| 
+---------+-------+----+----+------+---+-----+-----+ 
|  1|9/20/16|abc0|xyz0|  o| 1|orig1| abc0| 
|  1|9/20/16|abc1|xyz1|  o| 2|orig2| abc1| 
|  1|9/20/16|abc2|xyz2|  i| 3|dest3| xyz2| 
|  1|9/20/16|abc3|xyz3|  i| 4|dest4| xyz3| 
+---------+-------+----+----+------+---+-----+-----+ 

接下來讓我們定義一個可能級別的列表:

val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y")) 

最後pivot

val result = withKV 
    .groupBy($"sessionId", $"date") 
    .pivot("key", levels) 
    .agg(first($"value", true)).show 

而且result是:

+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+ 
|sessionId| date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4| 
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+ 
|  1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3| 
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+