如何解析使用Scala火花

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0

我能夠打印的恰好

架構中的數據維基信息框JSON

scala> data.printSchema 
root 
|-- batchcomplete: string (nullable = true) 
|-- query: struct (nullable = true) 
| |-- pages: struct (nullable = true) 
| | |-- 28597189: struct (nullable = true) 
| | | |-- ns: long (nullable = true) 
| | | |-- pageid: long (nullable = true) 
| | | |-- revisions: array (nullable = true) 
| | | | |-- element: struct (containsNull = true) 
| | | | | |-- *: string (nullable = true)  
| | | | | |-- contentformat: string (nullable = true) 
| | | | | |-- contentmodel: string (nullable = true) 
| | | |-- title: string (nullable = true)

我想提取的關鍵數據「*」 |-- *: string (nullable = true) 請給我建議的解決方案。

的一個問題是

pages: struct (nullable = true) 
    | | |-- 28597189: struct (nullable = true)

數28597189是唯一的每個標題。

來源

2017-02-09 Krish

首先，我們需要解析JSON拿到鑰匙（28597189）動態，然後用它來提取火花數據幀的數據，如低於

val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0) 
println(s"Key Name : $keyName")

這會給你動態的關鍵：

Key Name : 28597189

然後使用此來提取數據

var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*") 
revDf.printSchema()

輸出：

root 
|-- *: string (nullable = true) 
|-- contentformat: string (nullable = true) 
|-- contentmodel: string (nullable = true)

，我們將與一些關鍵的名稱重命名列*像star_column

revDf = revDf.withColumnRenamed("*", "star_column") 
revDf.printSchema()

輸出：

root 
|-- star_column: string (nullable = true) 
|-- contentformat: string (nullable = true) 
|-- contentmodel: string (nullable = true)

，一旦我們有我們的最終數據幀，我們將調用show

revDf.show()

輸出：

+--------------------+-------------+------------+ 
|   star_column|contentformat|contentmodel| 
+--------------------+-------------+------------+ 
|{{EngvarB|date=Se...| text/x-wiki| wikitext| 
+--------------------+-------------+------------+

來源

2017-02-11 11:50:21

如何解析使用Scala火花

回答

相關問題