2017-09-15 77 views
1

我有這樣一個數據幀一個UDF:如何創建,創建一個新的列,並修改現有列

id | color 
---| ----- 
1 | red-dark 
2 | green-light 
3 | red-light 
4 | blue-sky 
5 | green-dark 

我想創建一個UDF這樣,我的數據框變爲:

id | color | shade 
---| ----- | ----- 
1 | red | dark 
2 | green | light 
3 | red | light 
4 | blue | sky 
5 | green | dark 

我寫了一個UDF此:

def my_function(data_str): 
    return ",".join(data_str.split("-")) 

my_function_udf = udf(my_function, StringType()) 

#apply the UDF 

df = df.withColumn("shade", my_function_udf(df['color'])) 

不過,我想讓它成爲這個不改變數據幀。相反,它把它變成:

id | color  | shade 
---| ---------- | ----- 
1 | red-dark | red,dark 
2 | green-dark | green,light 
3 | red-light | red,light 
4 | blue-sky | blue,sky 
5 | green-dark | green,dark 

我該如何轉換數據幀,因爲我希望它在pyspark?

,嘗試了建議的問題

schema = ArrayType(StructType([ 
    StructField("color", StringType(), False), 
    StructField("shade", StringType(), False) 
])) 

color_shade_udf = udf(
    lambda s: [tuple(s.split("-"))], 
    schema 
) 

df = df.withColumn("colorshade", color_shade_udf(df['color'])) 

#Gives the following 

id | color  | colorshade 
---| ---------- | ----- 
1 | red-dark | [{"color":"red","shade":"dark"}] 
2 | green-dark | [{"color":"green","shade":"dark"}] 
3 | red-light | [{"color":"red","shade":"light"}] 
4 | blue-sky | [{"color":"blue","shade":"sky"}] 
5 | green-dark | [{"color":"green","shade":"dark"}] 

我覺得我越來越近

+0

@火花衛生學習現在只需做另一個'.withColumn(「color」,「colorshade.color」)「+用於遮蔽相似的+'dropColumn(」colorshade「)' –

回答

2

您可以使用內置的功能split()

from pyspark.sql.functions import split, col 

df.withColumn("arr", split(df.color, "\\-")) \ 
    .select("id", 
      col("arr")[0].alias("color"), 
      col("arr")[1].alias("shade")) \ 
    .drop("arr") \ 
    .show() 
+---+-----+-----+ 
| id|color|shade| 
+---+-----+-----+ 
| 1| red| dark| 
| 2|green|light| 
| 3| red|light| 
| 4| blue| sky| 
| 5|green| dark| 
+---+-----+-----+