你試過很可能讓你有解決方案。
你的數據看起來像這樣
val df = sc.parallelize(Array(
(1, "Shan", 101),
(2, "Shan", 101),
(3, "John", 102),
(4, "Michel", 103)
)).toDF("id","name","number")
那你自己認爲分組和計數。如果你不喜歡這樣
val repeatedNames = df.groupBy("name").count.where(col("count")>1).withColumnRenamed("name","repeated").drop("count")
,那麼你可以實際做這樣的事情以後得到所有的方式:
val repeated = df.join(repeatedNames, repeatedNames("repeated")===df("name")).drop("repeated")
val distinct = df.except(repeated)
repeated show
+---+----+------+
| id|name|number|
+---+----+------+
| 1|Shan| 101|
| 2|Shan| 101|
+---+----+------+
distinct show
+---+------+------+
| id| name|number|
+---+------+------+
| 4|Michel| 103|
| 3| John| 102|
+---+------+------+
希望它能幫助。
我都試過,但它返回對數計數 df.groupBy( 「數字」)。COUNT()。選擇( 「*」)。其中( 「計數> 1」) 我需要所有重複行與所有專欄 –