我有這樣一個數據幀:降低火花的數據幀以省略空單元
val df = sc.parallelize(List((1, 2012, 3, 5), (2, 2012, 4, 7), (1,2013, 1, 3), (2, 2013, 9, 5))).toDF("id", "year", "propA", "propB")
如此代碼由Pivot Spark Dataframe啓發的結果:
import org.apache.spark.sql.functions._
import sq.implicits._
years = List("2012", "2013")
val numYears = years.length - 1
//
var query2 = "select id, "
for (i <- 0 to numYears-1) {
query2 += "case when year = " + years(i) + " then propA else 0 end as " + "propA" + years(i) + ", "
query2 += "case when year = " + years(i) + " then propB else 0 end as " + "propB" + years(i) + ", "
}
query2 += "case when year = " + years.last + " then propA else 0 end as " + "propA" + years.last + ", "
query2 += "case when year = " + years.last + " then propB else 0 end as " + "propB" + years.last + " from myTable"
//
df.registerTempTable("myTable")
//
val myDF1 = sq.sql(query2)
我設法得到:
+---+---------+---------+---------+---------+
//| | id|propA2012|propB2012|propA2013|propB2013|
//| +---+---------+---------+---------+---------+
//| | 1| 3| 5| 0| 0|
//| | 2| 4| 7| 0| 0|
//| | 1| 0| 0| 1| 3|
//| | 2| 0| 0| 9| 5|
//| +---+---------+---------+---------+---------+
我設法減少
使用id propA-2012 propB-2012 propA-2013 propB-2013
1 3 5 1 3
2 4 7 9 5
:
val df2 = myDF1.groupBy("id").agg(
"propA2012" -> "sum",
"propA2013" -> "sum",
"propB2013" -> "sum",
"propB2012" -> "sum")
有沒有辦法只是遍歷所有列,而不指定的列名?
也許[.groupBy](https://spark.apache.org/docs/最新/ API /蟒蛇/ pyspark.sql.html?突顯=數據幀#pyspark.sql.DataFrame.groupBy)? –
如果我使用myDF1.groupBy(「id」)我沒有數據框作爲結果,但分組的數據,並不知道如何管理...如果你能產生一個片段,我會接受你的回答 – user299791
是的,你得到[GroupedData](https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.GroupedData)所以你需要聚合 –