2

我是一個正在學習python的火花。我有一個小問題,在像SQL這樣的其他語言中,我們可以簡單地按指定的列對錶進行分組,然後對它們執行進一步的操作,如sum,count等。我們如何在Spark中做到這一點?在Spark中groupBy的用法

我有一樣的模式:

[name:"ABC", city:"New York", money:"50"] 
    [name:"DEF", city:"London", money:"10"] 
    [name:"ABC", city:"New York", money:"30"] 
    [name:"XYZ", city:"London", money:"20"] 
    [name:"XYZ", city:"London", money:"100"] 
    [name:"DEF", city:"London", money:"200"] 

比方說,我想通過城市羣這個,然後執行這筆錢對於每個名稱。喜歡的東西:

New York ABC 80 
    London DEF 210 
    London XYZ 120 

回答

2

您可以使用SQL:

>>> sc.parallelize([ 
... {"name": "ABC", "city": "New York", "money":"50"}, 
... {"name": "DEF", "city": "London", "money":"10"}, 
... {"name": "ABC", "city": "New York", "money":"30"}, 
... {"name": "XYZ", "city": "London", "money":"20"}, 
... {"name": "XYZ", "city": "London", "money":"100"}, 
... {"name": "DEF", "city": "London", "money":"200"}, 
... ]).toDF().registerTempTable("df") 

>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total 
... FROM df GROUP name, city""") 
+0

感謝您的回覆。它會讓生活變得更容易,知道如何在spark中執行sql語句。 –

1

您可以在Python的做到這一點的方式,以及(或SQL版本@LostInOverflow發佈):

grouped = df.groupby('city', 'name').sum('money') 

它看起來像你的money列是字符串,所以你需要首先將它投射爲int(或以這種方式開始):

df = df.withColumn('money', df['money'].cast('int')) 

並且記住dataframes是不變的,所以這兩個要求您將它們分配到一個對象(即使它只是回df再次),然後用show,如果你想看到的結果。

編輯:我應該補充說,你需要先創建一個數據幀。對於簡單的數據,它幾乎與發佈的SQL版本相同,但將其分配給數據框對象而不是將其註冊爲表格:

df = sc.parallelize([ 
    {"name": "ABC", "city": "New York", "money":"50"}, 
    {"name": "DEF", "city": "London", "money":"10"}, 
    {"name": "ABC", "city": "New York", "money":"30"}, 
    {"name": "XYZ", "city": "London", "money":"20"}, 
    {"name": "XYZ", "city": "London", "money":"100"}, 
    {"name": "DEF", "city": "London", "money":"200"}, 
    ]).toDF() 
+0

謝謝傑夫,我真的把錢打印成了char的錯誤,但現在我知道當他們不是int時該怎麼辦。謝謝您的幫助! –