1
我有這樣一個數據幀一個UDF:如何創建,創建一個新的列,並修改現有列
id | color
---| -----
1 | red-dark
2 | green-light
3 | red-light
4 | blue-sky
5 | green-dark
我想創建一個UDF這樣,我的數據框變爲:
id | color | shade
---| ----- | -----
1 | red | dark
2 | green | light
3 | red | light
4 | blue | sky
5 | green | dark
我寫了一個UDF此:
def my_function(data_str):
return ",".join(data_str.split("-"))
my_function_udf = udf(my_function, StringType())
#apply the UDF
df = df.withColumn("shade", my_function_udf(df['color']))
不過,我想讓它成爲這個不改變數據幀。相反,它把它變成:
id | color | shade
---| ---------- | -----
1 | red-dark | red,dark
2 | green-dark | green,light
3 | red-light | red,light
4 | blue-sky | blue,sky
5 | green-dark | green,dark
我該如何轉換數據幀,因爲我希望它在pyspark?
,嘗試了建議的問題
schema = ArrayType(StructType([
StructField("color", StringType(), False),
StructField("shade", StringType(), False)
]))
color_shade_udf = udf(
lambda s: [tuple(s.split("-"))],
schema
)
df = df.withColumn("colorshade", color_shade_udf(df['color']))
#Gives the following
id | color | colorshade
---| ---------- | -----
1 | red-dark | [{"color":"red","shade":"dark"}]
2 | green-dark | [{"color":"green","shade":"dark"}]
3 | red-light | [{"color":"red","shade":"light"}]
4 | blue-sky | [{"color":"blue","shade":"sky"}]
5 | green-dark | [{"color":"green","shade":"dark"}]
我覺得我越來越近
@火花衛生學習現在只需做另一個'.withColumn(「color」,「colorshade.color」)「+用於遮蔽相似的+'dropColumn(」colorshade「)' –