2017-09-03 56 views
-2

我從CSV如下讀取數據幀,PySpark - 改變長期類型數組類型(長型)

df1= 
category value Referece value 
count   1  1 
n_timer   20  40,20 
frames   54  56 
timer   8  3,6,7 
pdf    99  100,101,22 
zip    10  10,11,12 

但它解讀爲長型和字符串類型的列,但我都在想數組類型(LongType),以便我可以交叉這些列並獲取輸出。

我想讀數據幀爲象下面這樣:

category value Referece value 

count  [1]  [1] 
n_timer  [20] [40,20] 
frames  [54] [56] 
timer  [8]  [3,6,7] 
pdf   [99] [100,101,22] 
zip   [10] [10,11,12] 

請提出一些解決方案

回答

-2
# check below code 
from pyspark import SparkContext 
from pyspark.sql.functions import split 
sc = SparkContext.getOrCreate() 
df1 = sc.parallelize([("count","1","1"), ("n_timer","20","40,20"), ("frames","54","56"),("timer","8","3,6,7"),("pdf","99","100,101,22"),("zip","10","10,11,12")]).toDF(["category", "value","Reference_value"]) 
print(df1.show()) 
df1=df1.withColumn("Reference_value", split("Reference_value", ",\s*").cast("array<long>")) 
df1=df1.withColumn("value", split("value", ",\s*").cast("array<long>")) 
print(df1.show()) 

Input df1= 
+--------+-----+---------------+ 
|category|value|Reference_value| 
+--------+-----+---------------+ 
| count| 1|    1| 
| n_timer| 20|   40,20| 
| frames| 54|    56| 
| timer| 8|   3,6,7| 
|  pdf| 99|  100,101,22| 
|  zip| 10|  10,11,12| 
+--------+-----+---------------+ 

output df2= 
+--------+-----+---------------+ 
|category|value|Reference_value| 
+--------+-----+---------------+ 
| count| [1]|   [1]| 
| n_timer| [20]|  [40, 20]| 
| frames| [54]|   [56]| 
| timer| [8]|  [3, 6, 7]| 
|  pdf| [99]| [100, 101, 22]| 
|  zip| [10]| [10, 11, 12]| 
+--------+-----+---------------+ 
-2

與值和參考列作爲陣列型編碼器一類..

如何在JAVA中執行: Dataset sampleDim = sqlContext.read().csv(filePath).as(Encoders.bean(sample.class));

您可以在Python中使用相同的方法