這可以使用window function來完成。
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = sc.parallelize(Seq(("cat", 8, "01:00"),("cat", 5, "02:00"),("cat", 6, "03:00"),("dog", 2, "02:00"),("dog", 4, "04:00"),("dog", 3, "06:00"),("rat", 7, "01:00"),("rat", 4, "03:00"),("rat", 9, "05:00"))).toDF("animal", "value", "time")
df.show
+------+-----+-----+
|animal|value| time|
+------+-----+-----+
| cat| 8|01:00|
| cat| 5|02:00|
| cat| 6|03:00|
| dog| 2|02:00|
| dog| 4|04:00|
| dog| 3|06:00|
| rat| 7|01:00|
| rat| 4|03:00|
| rat| 9|05:00|
+------+-----+-----+
我已經添加了一個「時間」字段來說明orderBy。
val w1 = Window.partitionBy($"animal").orderBy($"time")
val previous_value = lag($"value", 1).over(w1)
val df1 = df.withColumn("previous", previous_value)
df1.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| null|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| null|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| null|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
如果你想用0替換空值:
val df2 = df1.na.fill(0)
df2.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| 0|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| 0|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| 0|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
我沒有時間去嘗試,但現在1)不依賴於數據幀排序,添加一個明確的'index'列和2)嘗試通過'animal''repartition'ing,然後使用'mapPartitions'來做你的行抵消。它可能不會很漂亮 –