2017-08-02 60 views
0

所以我的表看起來是這樣的:
在星火瀝水最後一個項目重複陣列結構的陣列結構數據幀

customer_1|place|customer_2|item   |count 
------------------------------------------------- 
    a  | NY | b  |(2010,304,310)| 34 
    a  | NY | b  |(2024,201,310)| 21 
    a  | NY | b  |(2010,304,312)| 76 
    c  | NY | x  |(2010,304,310)| 11 
    a  | NY | b  |(453,131,235) | 10 

我試着做,但是這並沒有消除重複的,因爲前者是數組仍然存在(因爲它應該是,我需要它爲最終結果)。

val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count"))) 
     .groupBy(col("customer_1"), col("place"), col("customer_2")) 
     .agg(max("vs").alias("vs")) 
     .select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count")) 

我想按customer_1,地點和customer_2列,僅返回陣列結構,其最後一個項目(-1)是具有最高計數獨特的,任何想法?
預期輸出:

customer_1|place|customer_2|item   |count 
------------------------------------------------- 
    a  | NY | b  |(2010,304,312)| 76 
    a  | NY | b  |(2010,304,310)| 34 
    a  | NY | b  |(453,131,235) | 10 
    c  | NY | x  |(2010,304,310)| 11 

回答

0

鑑於dataframeschema

root 
|-- customer_1: string (nullable = true) 
|-- place: string (nullable = true) 
|-- customer_2: string (nullable = true) 
|-- item: array (nullable = true) 
| |-- element: integer (containsNull = false) 
|-- count: string (nullable = true) 

您可以申請concat funcations創建檢查重複行temp欄下方

import org.apache.spark.sql.functions._ 
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1))) 
    .dropDuplicates("temp") 
    .drop("temp") 
爲已完成

你應該得到以下輸出

+----------+-----+----------+----------------+-----+ 
|customer_1|place|customer_2|item   |count| 
+----------+-----+----------+----------------+-----+ 
|a   |NY |b   |[2010, 304, 312]|76 | 
|c   |NY |x   |[2010, 304, 310]|11 | 
|a   |NY |b   |[453, 131, 235] |10 | 
|a   |NY |b   |[2010, 304, 310]|34 | 
+----------+-----+----------+----------------+-----+ 

結構

鑑於dataframeschema

root 
|-- customer_1: string (nullable = true) 
|-- place: string (nullable = true) 
|-- customer_2: string (nullable = true) 
|-- item: struct (nullable = true) 
| |-- _1: integer (nullable = false) 
| |-- _2: integer (nullable = false) 
| |-- _3: integer (nullable = false) 
|-- count: string (nullable = true) 

我們仍然可以做上面一樣,在獲得來自struct第三項作爲

略有變化
import org.apache.spark.sql.functions._ 
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3")) 
    .dropDuplicates("temp") 
    .drop("temp") 

希望答案有幫助