如何創建，創建一個新的列，並修改現有列

我有這樣一個數據幀一個UDF：如何創建，創建一個新的列，並修改現有列

id | color 
---| ----- 
1 | red-dark 
2 | green-light 
3 | red-light 
4 | blue-sky 
5 | green-dark

我想創建一個UDF這樣，我的數據框變爲：

id | color | shade 
---| ----- | ----- 
1 | red | dark 
2 | green | light 
3 | red | light 
4 | blue | sky 
5 | green | dark

我寫了一個UDF此：

def my_function(data_str): 
    return ",".join(data_str.split("-")) 

my_function_udf = udf(my_function, StringType()) 

#apply the UDF 

df = df.withColumn("shade", my_function_udf(df['color']))

不過，我想讓它成爲這個不改變數據幀。相反，它把它變成：

id | color  | shade 
---| ---------- | ----- 
1 | red-dark | red,dark 
2 | green-dark | green,light 
3 | red-light | red,light 
4 | blue-sky | blue,sky 
5 | green-dark | green,dark

我該如何轉換數據幀，因爲我希望它在pyspark？

，嘗試了建議的問題

schema = ArrayType(StructType([ 
    StructField("color", StringType(), False), 
    StructField("shade", StringType(), False) 
])) 

color_shade_udf = udf(
    lambda s: [tuple(s.split("-"))], 
    schema 
) 

df = df.withColumn("colorshade", color_shade_udf(df['color'])) 

#Gives the following 

id | color  | colorshade 
---| ---------- | ----- 
1 | red-dark | [{"color":"red","shade":"dark"}] 
2 | green-dark | [{"color":"green","shade":"dark"}] 
3 | red-light | [{"color":"red","shade":"light"}] 
4 | blue-sky | [{"color":"blue","shade":"sky"}] 
5 | green-dark | [{"color":"green","shade":"dark"}]

我覺得我越來越近

來源

2017-09-15 spark-health-learn

@火花衛生學習現在只需做另一個'.withColumn（「color」，「colorshade.color」）「+用於遮蔽相似的+'dropColumn（」colorshade「）' –

您可以使用內置的功能split()：

from pyspark.sql.functions import split, col 

df.withColumn("arr", split(df.color, "\\-")) \ 
    .select("id", 
      col("arr")[0].alias("color"), 
      col("arr")[1].alias("shade")) \ 
    .drop("arr") \ 
    .show() 
+---+-----+-----+ 
| id|color|shade| 
+---+-----+-----+ 
| 1| red| dark| 
| 2|green|light| 
| 3| red|light| 
| 4| blue| sky| 
| 5|green| dark| 
+---+-----+-----+

來源

2017-09-15 12:55:49 mtoto

如何創建，創建一個新的列，並修改現有列

回答

相關問題