2016-04-27 653 views
5

我試圖找到將整個Spark數據框轉換爲scala Map集合的最佳解決方案。將Spark Dataframe轉換爲Scala Map集合

要從此走(星火例子):這是如下最好的說明

val df = sqlContext.read.json("examples/src/main/resources/people.json") 

df.show 
+----+-------+ 
| age| name| 
+----+-------+ 
|null|Michael| 
| 30| Andy| 
| 19| Justin| 
+----+-------+ 

要Scala的集合(地圖地圖)這樣表示:

val people = Map(
Map("age" -> null, "name" -> "Michael"), 
Map("age" -> 30, "name" -> "Andy"), 
Map("age" -> 19, "name" -> "Justin") 
) 

回答

6

我不要認爲你的問題是有道理的 - 你最外面的Map,我只看到你正在試圖填充值 - 你需要在最外面的Map有鍵/值對。話雖這麼說:

val peopleArray = df.collect.map(r => Map(df.columns.zip(r.toSeq):_*)) 

會給你:

Array(
    Map("age" -> null, "name" -> "Michael"), 
    Map("age" -> 30, "name" -> "Andy"), 
    Map("age" -> 19, "name" -> "Justin") 
) 

在這一點上,你可以這樣做:

val people = Map(peopleArray.map(p => (p.getOrElse("name", null), p)):_*) 

這將使你:

Map(
    ("Michael" -> Map("age" -> null, "name" -> "Michael")), 
    ("Andy" -> Map("age" -> 30, "name" -> "Andy")), 
    ("Justin" -> Map("age" -> 19, "name" -> "Justin")) 
) 

我猜猜這實在是你想要的更多。如果你想鍵入它們的任意Long指數,你可以這樣做:

val indexedPeople = Map(peopleArray.zipWithIndex.map(r => (r._2, r._1)):_*) 

它給你:

Map(
    (0 -> Map("age" -> null, "name" -> "Michael")), 
    (1 -> Map("age" -> 30, "name" -> "Andy")), 
    (2 -> Map("age" -> 19, "name" -> "Justin")) 
) 
+0

工作。我其實是錯過了。我只需要一個地圖集合,第一行就是我所需要的。謝謝 –

+0

甜蜜,接受我的回答呢? '';-) –

0

首先從數據幀

val schemaList = dataframe.schema.map(_.name).zipWithIndex//get schema list from dataframe 

獲取RDD獲取架構從數據幀和映射到它

dataframe.rdd.map(row => 
    //here rec._1 is column name and rce._2 index 
    schemaList.map(rec => (rec._1, row(rec._2))).toMap 
).collect.foreach(println) 
相關問題