2017-06-01 61 views
1

如何在spark/scala中打印包含對象嵌套數組的單個元素?如何使用scala/spark正確迭代/打印拼花地板?

{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]} 
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]} 

具體我希望能夠然後每個項目迭代的對象,並打印出編號,姓名和年齡...路徑。然後繼續打印下一個記錄和soforth。假設我已經閱讀了拼花文件,並有數據幀,我要像做以下(僞):

如果
val records = dataframe.map { 
    row => { 
    val id = row.getString("id") 
    val name = row.getString("id") 
    val age = row.getString("age") 
    println("${id} ${name} ${age}") 
    row.getArray("path").map { 
     item => { 
       val x = item.getValue("x") 
       val y = item.getValue("y") 
       println("${x} ${y}") 
     } 
    } 
    } 
} 

不知道上面是去了解它的正確方法,但它應該給你瞭解我想要做什麼。

回答

1
val spark = SparkSession 
    .builder() 
    .master("local") 
    .appName("ParquetAppendMode") 
    .getOrCreate() 

    import spark.implicits._ 


    val data1 = spark.read.json("/home/sakoirala/IdeaProjects/SparkSolutions/src/main/resources/explode.json") 

    val result = data1.withColumn("path", explode($"path")) 

    result.withColumn("x", result("path.x")) 
    .withColumn("y", result("path.y")).show() 

輸出:

val records = dataframe.select("id", "age", "path.x", "path.y") 

然後,您可以使用顯示打印數據:

+---+----+-------+-----------+----+----+ 
|age| id| name|  path| x| y| 
+---+----+-------+-----------+----+----+ 
| 25|1201| satish| [1.0,1.0]| 1.0| 1.0| 
| 25|1201| satish| [2.0,2.0]| 2.0| 2.0| 
| 28|1202|krishna|[1.23,2.12]|1.23|2.12| 
| 28|1202|krishna|[1.23,2.12]|1.23|2.12| 
+---+----+-------+-----------+----+----+ 
0

您可以完全使用Dataframe API完成此操作;不需要使用map

下面是如何可以很容易地通過投影領域壓扁你的模式要使用:

records.show()