2017-04-03 72 views
0

我的JSON文件有很多行,每行看起來是這樣的。星火 - 解析其中包含額外的文本JSON文件

Mon Jan 20 00:00:00 -0800 2014, {"cl":"js","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36","ip":"76.4.253.137","cc":"US","rg":"NV","ct":"North Las Vegas","pc":"89084","mc":839,"bf":"402d6c3bdd18e5b5f6541a98a01ecc47d698420d","vst":"0e1c96ff-1f4a-4279-bfdc-ba3fe51c2a4e","lt":"Sun Jan 19 23:59:59 -0800 2014","hk":["memba","alyson stoner","memba them","member them","member them 80s","missy elliotts","www.tmzmembathem","80s memba then","missy elliott","mini"]}, 

/增加了清晰的目的,空間/

{"v":"1.1","pv":"7963ee21-0d09-4924-b315-ced4adad425f","r":"v3","t":"tmzdtcom","a":[{"i":15,"u":"ll-media.tmz.com/2012/10/03/100312-alyson-stoner-then-480w.jpg","w":523,"h":480,"x":503,"y":651,"lt":"none","af":false}],"rf":"http://www.zergnet.com/news/128786/stars-whove-changed-a-lot-since-you-last-saw-them","p":"www.tmz.com/photos/2007/12/20/740-memba-them/images/2012/10/03/100312-alyson-stoner-then-jpg/","fs":true,"tr":0.7,"ac":{},"vp":{"ii":false,"w":1915,"h":1102},"sc":{"w":1920,"h":1200,"d":1},"pid":239,"vid":1,"ss":"0.5"} 

我試過如下:

方法1:

val value1 = sc.textFile(filename).map(_.substring(32)) 

val df = sqlContext.read.json(value1) 

在這裏,我想省略文本w這是在行的開始。在這種情況下,我只獲得每行的第一個json對象。

即:

{"cl":"js","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36","ip":"76.4.253.137","cc":"US","rg":"NV","ct":"North Las Vegas","pc":"89084","mc":839,"bf":"402d6c3bdd18e5b5f6541a98a01ecc47d698420d","vst":"0e1c96ff-1f4a-4279-bfdc-ba3fe51c2a4e","lt":"Sun Jan 19 23:59:59 -0800 2014","hk":["memba","alyson stoner","memba them","member them","member them 80s","missy elliotts","www.tmzmembathem","80s memba then","missy elliott","mini"]} 

方法2:

val df = sqlContext.read.json(sc.wholeTextFiles(filename).values) 

在這種情況下,我剛開輸出作爲一個腐敗的紀錄。

能否請你告訴我在這裏又是怎樣的問題來分析這種文件?

回答

1

sqlContext.read.json只適用於出現行由行而不是擴大或文件的完整JSON條目「漂亮打印」。最好的辦法是要做到這一點:

val jsonRDD = sparkContext.wholeTextFiles(fileName).map(_._2) 

documentationwholeTextFiles回報RDD[(String, String)]其中每個Tuple2的第一項是文件名,第二個是內容的說明。只有第二個是你關心的,所以你可以通過._2訪問內容。

然後您可以將RDDDataFrame和應用to_json的內容轉換爲描述here

val jsonDF = sparkContext 
    .wholeTextFiles(fileName) 
    .map(_._2) 
    .toDF("json") 
    .select(to_json('json)) 
+1

我可以知道它做什麼? –

+0

我建議您嘗試在控制檯或在您的實際代碼新的東西來感受一下它,當你學習的 - 或者至少讀Scaladoc - 但我已經更新了答案。 – Vidya