2017-09-14 66 views
1

讓我有些Jsons如下數據框中檢查是否嵌套JSON列中存在

{"Location": 
    {"filter": 
     {"name": "houston", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "florida", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "seattle"}, 
    } 
} 

我用spark.read.json(「myfile.json」)後,我想篩選出的數據行時,不包含災難。在我的例子中,西雅圖行應該被過濾掉。

我試圖

val newTable = df.filter($"Location.filter.Disaster" isnotnull) 

但給我的struct災難不存在錯誤。

那麼我該如何做到這一點?

感謝

回答

0

json數據似乎已損壞,即它不能通過使用spark.read.json("myfile.json")

有解決類似的問題通過使用wholeTextFiles API讀入有效的數據幀

val rdd = sc.wholeTextFiles("myfile.json") 
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n")) 

這應該會給你rdd數據(個有效jsons)作爲

{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"seattle"}}} 

現在你可以閱讀json rdddataframe

val df = sqlContext.read.json(json) 

這應該給你

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
|[[null,seattle]]  | 
+---------------------+ 

schema

root 
|-- Location: struct (nullable = true) 
| |-- filter: struct (nullable = true) 
| | |-- Disaster: string (nullable = true) 
| | |-- name: string (nullable = true) 

現在,你有一個有效的數據幀,您可以將filter你申請

val newTable = df.filter($"Location.filter.Disaster" isnotnull) 

newTable

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
+---------------------+