2017-07-18 124 views
0

我正在使用spark數據框(以scala的形式),我想要做的是按列進行分組,並將不同的組作爲一系列數據框。將Spark數據框Groupby轉換爲數據框序列

所以它看起來像

df.groupyby("col").toSeq -> Seq[DataFrame] 

更妙的是同一個密鑰對

df.groupyby("col").toSeq -> Dict[key, DataFrame] 

這似乎是一個明顯的事情把它變成什麼,但我可以」噸似乎弄清楚它可能如何工作

回答

2

這是你可以做的,這裏是一個簡單的例子

import spark.implicits._ 
val data = spark.sparkContext.parallelize(Seq(
    (29,"City 2", 72), 
    (28,"City 3", 48), 
    (28,"City 2", 19), 
    (27,"City 2", 16), 
    (28,"City 1", 84), 
    (28,"City 4", 72), 
    (29,"City 4", 39), 
    (27,"City 3", 42), 
    (26,"City 3", 68), 
    (27,"City 1", 89), 
    (27,"City 4", 104), 
    (26,"City 2", 19), 
    (29,"City 3", 27) 
)).toDF("week", "city", "sale") 
//create a dataframe with dummy data 


//get list of cities 
val city = data.select("city").distinct.collect().flatMap(_.toSeq) 

// get all the columns for each city 
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe) 
val result = city.map(c => (c -> data.where(($"city" === c)))) 

//print all the dataframes 
result.foreach(a=> 
    println(s"Dataframe with ${a._1}") 
    a._2.show() 
}) 

輸出如下所示

數據幀與市1

+----+------+----+ 
|week| city|sale| 
+----+------+----+ 
| 28|City 1| 84| 
| 27|City 1| 89| 
+----+------+----+ 

數據幀與市3

+----+------+----+ 
|week| city|sale| 
+----+------+----+ 
| 28|City 3| 48| 
| 27|City 3| 42| 
| 26|City 3| 68| 
| 29|City 3| 27| 
+----+------+----+ 

數據幀與市4

+----+------+----+ 
|week| city|sale| 
+----+------+----+ 
| 28|City 4| 72| 
| 29|City 4| 39| 
| 27|City 4| 104| 
+----+------+----+ 

數據幀與市2

+----+------+----+ 
|week| city|sale| 
+----+------+----+ 
| 29|City 2| 72| 
| 28|City 2| 19| 
| 27|City 2| 16| 
| 26|City 2| 19| 
+----+------+----+ 

您還可以使用partitionby對數據進行分組,並寫入到輸出

dataframe.write.partitionBy("col").saveAsTable("outputpath") 

這造成"col"

希望這有助於進行分組對每一個輸出文件!

+0

謝謝 - 這是完美的。建議的副本也回答了我的問題,但我已經接受了您的答案。 – thebigdog

+0

非常感謝接受@ thebigdog :) –

相關問題