如何轉換字符串的火花數據幀陣列中的蟒蛇

到矢量我有一個表test_tbl：如何轉換字符串的火花數據幀陣列中的蟒蛇

+-----------------+--------------+--------------+--+ 
| test_tbl.label | test_tbl.f1 | test_tbl.f2 | 
+-----------------+--------------+--------------+--+ 
| 0    | a   | b   | 
| 1    | c   | d   | 
+-----------------+--------------+--------------+--+

我想列F1和F2組合成具有以下pyspark代碼矢量：

arr_to_vector = udf(lambda a: Vectors.dense(a), VectorUDT()) 
df = sqlContext.sql("""SELECT label,array(f1, f2) as features       
         FROM test_tbl""") 
df_vector = df.select(df["label"], 
arr_to_vector(df["features"]).alias("features")) 
df_vector.show()

然後，我得到了錯誤： ValueError：使用序列設置數組元素。

然而，如果我改變在表中的F1的值和f2是號碼，如（雖然列的數據類型被定義爲字符串）：

+-----------------+--------------+--------------+--+ 
| test_tbl.label | test_tbl.f1 | test_tbl.f2 | 
+-----------------+--------------+--------------+--+ 
| 0    | 0.1   | 0.2   | 
| 1    | 0.3   | 0.4   | 
+-----------------+--------------+--------------+--+

的誤差消失，UDF工作正常。

任何人都可以幫忙嗎？

來源

2017-07-26 Jiayun Zhao

您可以考慮使用StringIndexer將分類變量轉換爲float。

https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer 

df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], 
    ["id", "category"]) 

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
indexed = indexer.fit(df).transform(df) 
indexed.show()

來源

2018-02-27 22:12:38

你應該包括將回答這個問題使用的例子。鏈接死亡，然後你的答案將不提供任何信息。 – sorak

如何轉換字符串的火花數據幀陣列中的蟒蛇

回答

相關問題