在星火瀝水最後一個項目重複陣列結構的陣列結構數據幀

所以我的表看起來是這樣的：
在星火瀝水最後一個項目重複陣列結構的陣列結構數據幀

customer_1|place|customer_2|item   |count 
------------------------------------------------- 
    a  | NY | b  |(2010,304,310)| 34 
    a  | NY | b  |(2024,201,310)| 21 
    a  | NY | b  |(2010,304,312)| 76 
    c  | NY | x  |(2010,304,310)| 11 
    a  | NY | b  |(453,131,235) | 10

我試着做，但是這並沒有消除重複的，因爲前者是數組仍然存在（因爲它應該是，我需要它爲最終結果）。

val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count"))) 
     .groupBy(col("customer_1"), col("place"), col("customer_2")) 
     .agg(max("vs").alias("vs")) 
     .select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))

我想按customer_1，地點和customer_2列，僅返回陣列結構，其最後一個項目（-1）是具有最高計數獨特的，任何想法？
預期輸出：

customer_1|place|customer_2|item   |count 
------------------------------------------------- 
    a  | NY | b  |(2010,304,312)| 76 
    a  | NY | b  |(2010,304,310)| 34 
    a  | NY | b  |(453,131,235) | 10 
    c  | NY | x  |(2010,304,310)| 11

來源

2017-08-02 Michael Ost

鑑於dataframe的schema是

root 
|-- customer_1: string (nullable = true) 
|-- place: string (nullable = true) 
|-- customer_2: string (nullable = true) 
|-- item: array (nullable = true) 
| |-- element: integer (containsNull = false) 
|-- count: string (nullable = true)

您可以申請concat funcations創建檢查重複行temp欄下方

import org.apache.spark.sql.functions._ 
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1))) 
    .dropDuplicates("temp") 
    .drop("temp")

爲已完成

你應該得到以下輸出

+----------+-----+----------+----------------+-----+ 
|customer_1|place|customer_2|item   |count| 
+----------+-----+----------+----------------+-----+ 
|a   |NY |b   |[2010, 304, 312]|76 | 
|c   |NY |x   |[2010, 304, 310]|11 | 
|a   |NY |b   |[453, 131, 235] |10 | 
|a   |NY |b   |[2010, 304, 310]|34 | 
+----------+-----+----------+----------------+-----+

結構

鑑於dataframe的schema是

root 
|-- customer_1: string (nullable = true) 
|-- place: string (nullable = true) 
|-- customer_2: string (nullable = true) 
|-- item: struct (nullable = true) 
| |-- _1: integer (nullable = false) 
| |-- _2: integer (nullable = false) 
| |-- _3: integer (nullable = false) 
|-- count: string (nullable = true)

我們仍然可以做上面一樣，在獲得來自struct第三項作爲

略有變化

import org.apache.spark.sql.functions._ 
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3")) 
    .dropDuplicates("temp") 
    .drop("temp")

希望答案有幫助

來源

2017-08-02 11:36:20

在星火瀝水最後一個項目重複陣列結構的陣列結構數據幀

回答

相關問題