2017-02-09 70 views
0

我想從JSON數據,我是從維基API如何解析使用Scala火花

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Rajanna&rvsection=0

我能夠打印的恰好

架構中的數據維基信息框JSON
scala> data.printSchema 
root 
|-- batchcomplete: string (nullable = true) 
|-- query: struct (nullable = true) 
| |-- pages: struct (nullable = true) 
| | |-- 28597189: struct (nullable = true) 
| | | |-- ns: long (nullable = true) 
| | | |-- pageid: long (nullable = true) 
| | | |-- revisions: array (nullable = true) 
| | | | |-- element: struct (containsNull = true) 
| | | | | |-- *: string (nullable = true)  
| | | | | |-- contentformat: string (nullable = true) 
| | | | | |-- contentmodel: string (nullable = true) 
| | | |-- title: string (nullable = true) 

我想提取的關鍵數據「*」 |-- *: string (nullable = true) 請給我建議的解決方案。

的一個問題是

pages: struct (nullable = true) 
    | | |-- 28597189: struct (nullable = true) 

數28597189是唯一的每個標題。

回答

1

首先,我們需要解析JSON拿到鑰匙(28597189)動態,然後用它來提取火花數據幀的數據,如低於

val keyName = dataFrame.selectExpr("query.pages.*").schema.fieldNames(0) 
println(s"Key Name : $keyName") 

這會給你動態的關鍵:

Key Name : 28597189 

然後使用此來提取數據

var revDf = dataFrame.select(explode(dataFrame(s"query.pages.$keyName.revisions")).as("revision")).select("revision.*") 
revDf.printSchema() 

輸出:

root 
|-- *: string (nullable = true) 
|-- contentformat: string (nullable = true) 
|-- contentmodel: string (nullable = true) 

,我們將與一些關鍵的名稱重命名列*star_column

revDf = revDf.withColumnRenamed("*", "star_column") 
revDf.printSchema() 

輸出:

root 
|-- star_column: string (nullable = true) 
|-- contentformat: string (nullable = true) 
|-- contentmodel: string (nullable = true) 

,一旦我們有我們的最終數據幀,我們將調用show

revDf.show() 

輸出:

+--------------------+-------------+------------+ 
|   star_column|contentformat|contentmodel| 
+--------------------+-------------+------------+ 
|{{EngvarB|date=Se...| text/x-wiki| wikitext| 
+--------------------+-------------+------------+