2017-04-26 121 views
1

我試圖將數據幀轉換爲RDD。我的數據框已鍵入列,就像這樣:如何在轉換Scala Spark DF - > RDD時保留類型?

df.printSchema 
root 
|-- _c0: integer (nullable = true) 
|-- num_hits: integer (nullable = true) 
|-- session_name: string (nullable = true) 
|-- user_id: string (nullable = true) 

當我去將其轉換爲使用df.rdd的RDD,我得到一個RDD是類型Array[org.apache.spark.sql.Row]的,但是當我訪問使用每個條目rdd(0)(0)rdd(0)(1)等。我得到他們都有Any類型。如何保持DataFrame將其轉換爲RDD時的相同輸入?換句話說:我如何讓我的rdd中的列具有類型Int,Int, String, String,以便它們與Dataframe匹配?

回答

3

您只需將您的DataFrameDataset[(Int, Int, String, String)],如

scala> val df = Seq((1, 2, "a", "b")).toDF("_c0", "num_hits", "session_name", "user_id") 
df: org.apache.spark.sql.DataFrame = [_c0: int, num_hits: int ... 2 more fields] 

scala> df.printSchema 
root 
|-- _c0: integer (nullable = false) 
|-- num_hits: integer (nullable = false) 
|-- session_name: string (nullable = true) 
|-- user_id: string (nullable = true) 


scala> val rdd = df.as[(Int, Int, String, String)].rdd 
rdd: org.apache.spark.rdd.RDD[(Int, Int, String, String)] = MapPartitionsRDD[3] at rdd at <console>:25 

如果_c0num_hits可以null,只是改變Intjava.lang.Integer

+0

這樣做。謝謝! df.rdd沒有選擇類型是否有原因? – tSchema

+0

因爲DataFrame不知道你想要什麼類型。作爲[(Int,Int,String,String)]'基本上只是告訴Spark你想將Row轉換爲'(Int,Int,String,String)' – zsxwing

相關問題