pypark中的Dataframe - 如何將聚合函數應用到兩列中？

我在pyspark中使用Dataframe。我有一個表，如表1所示。我需要得到表2其中：pypark中的Dataframe - 如何將聚合函數應用到兩列中？

num_category - 這是多少型動物類別每個ID
總和（計數） - 這是第三列的表1中每個ID的總和。

實施例：

表1

id |category | count 

1 | 4 | 1 
1 | 3 | 2 
1 | 1 | 2 
2 | 2 | 1 
2 | 1 | 1

表2

id |num_category| sum(count) 

1 | 3  | 5 
2 | 2  | 2

我嘗試：

table1 = data.groupBy("id","category").agg(count("*")) 
cat = table1.groupBy("id").agg(count("*")) 
count = table1.groupBy("id").agg(func.sum("count")) 
table2 = cat.join(count, cat.id == count.id)

Error:

 1 table1 = data.groupBy("id","category").agg(count("*")) 
---> 2 cat = table1.groupBy("id").agg(count("*")) 
     count = table1.groupBy("id").agg(func.sum("count")) 
     table2 = cat.join(count, cat.id == count.id) 
TypeError: 'DataFrame' object is not callable

來源

2017-07-28 Thaise

您可以對分組數據做多列聚集，

data.groupby('id').agg({'category':'count','count':'sum'}).withColumnRenamed('count(category)',"num_category").show() 
+---+-------+--------+ 
| id|num_cat|sum(cnt)| 
+---+-------+--------+ 
| 1|  3|  5| 
| 2|  2|  2| 
+---+-------+--------+

來源

2017-07-28 16:10:46 Suresh

它是完美的！ TKS！ – Thaise

pypark中的Dataframe - 如何將聚合函數應用到兩列中？

回答

相關問題