數據框中檢查是否嵌套JSON列中存在

讓我有些Jsons如下數據框中檢查是否嵌套JSON列中存在

{"Location": 
    {"filter": 
     {"name": "houston", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "florida", "Disaster": "hurricane"}, 
    } 
} 
{"Location": 
    {"filter": 
     {"name": "seattle"}, 
    } 
}

我用spark.read.json（「myfile.json」）後，我想篩選出的數據行時，不包含災難。在我的例子中，西雅圖行應該被過濾掉。

我試圖

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

但給我的struct災難不存在錯誤。

那麼我該如何做到這一點？

感謝

來源

2017-09-14 Chen Fan

你json數據似乎已損壞，即它不能通過使用spark.read.json("myfile.json")

有解決類似的問題通過使用wholeTextFiles API讀入有效的數據幀

val rdd = sc.wholeTextFiles("myfile.json") 
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n"))

這應該會給你rdd數據（個有效jsons）作爲

{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}} 
{"Location":{"filter":{"name":"seattle"}}}

現在你可以閱讀json rdd到dataframe

val df = sqlContext.read.json(json)

這應該給你

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
|[[null,seattle]]  | 
+---------------------+

與schema爲

root 
|-- Location: struct (nullable = true) 
| |-- filter: struct (nullable = true) 
| | |-- Disaster: string (nullable = true) 
| | |-- name: string (nullable = true)

現在，你有一個有效的數據幀，您可以將filter你申請

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

newTable將

+---------------------+ 
|Location    | 
+---------------------+ 
|[[hurricane,houston]]| 
|[[hurricane,florida]]| 
+---------------------+

來源

2017-09-15 02:18:14

數據框中檢查是否嵌套JSON列中存在

回答

相關問題