Apache Spark在RDD或基於另一行的數據集中更新一行

我想知道如何基於另一行更新某些行。Apache Spark在RDD或基於另一行的數據集中更新一行

例如，我有一個像

Id | useraname | ratings | city 
-------------------------------- 
1, philip, 2.0, montreal, ... 
2, john, 4.0, montreal, ... 
3, charles, 2.0, texas, ...

一些數據我想更新在同一個城市同一個的groupId用戶（1或2）

Id | useraname | ratings | city 
-------------------------------- 
1, philip, 2.0, montreal, ... 
1, john, 4.0, montreal, ... 
3, charles, 2.0, texas, ...

我如何能實現這在我的RDD或數據集？

所以，只是爲了完整性，如果Id是一個字符串，密集排名不會工作？

例如？

Id | useraname | ratings | city 
-------------------------------- 
a, philip, 2.0, montreal, ... 
b, john, 4.0, montreal, ... 
c, charles, 2.0, texas, ...

所以結果看起來是這樣的：

grade | useraname | ratings | city 
-------------------------------- 
a, philip, 2.0, montreal, ... 
a, john, 4.0, montreal, ... 
c, charles, 2.0, texas, ...

來源

2016-10-14 Adetiloye Philip Kehinde

嘗試：

df.select("city").distinct.withColumn("id", monotonically_increasing_id).join(df.drop("id"), Seq("city"))

來源

2016-10-14 16:57:33

一個乾淨的方式來做到這一點是使用dense_rank()從Window功能。它列舉了Window列中的唯一值。因爲city是String列，所以這些將按字母順序逐漸增加。

import org.apache.spark.sql.functions.rank 
import org.apache.spark.sql.expressions.Window 

val df = spark.createDataFrame(Seq(
    (1, "philip", 2.0, "montreal"), 
    (2, "john", 4.0, "montreal"), 
    (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city") 

val w = Window.orderBy($"city") 
df.withColumn("id", rank().over(w)).show() 

+---+--------+------+--------+ 
| id|username|rating| city| 
+---+--------+------+--------+ 
| 1| philip| 2.0|montreal| 
| 1| john| 4.0|montreal| 
| 2| charles| 2.0| texas| 
+---+--------+------+--------+

來源

2016-10-14 17:02:00 mtoto

恐怕這不是分佈式，但可能在這裏如此upvote。 –

@mtoto感謝您的解決方案，但只是問如果'id'是一個字符串，密集的排名不會工作？ –

這種方法沒有考慮現有的「id」列，它僅僅爲「城市」列的每個唯一值賦予唯一鍵。 – mtoto

Apache Spark在RDD或基於另一行的數據集中更新一行

回答

相關問題