2016-11-29 208 views
5

我有一個具有多個對象,如JSON文件:轉換JSON到大熊貓數據幀

{"reviewerID": "bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd", "asin": "a6ed402934e3c1138111dce09256538afb04c566edf37c16b9ba099d23afb764", "overall": 2.0, "helpful": {"nHelpful": 1, "outOf": 1}, "reviewText": "This remote, for whatever reason, was chosen by Time Warner to replace their previous silver remote, the Time Warner Synergy V RC-U62CP-1.12S. The actual function of this CLIKR-5 is OK, but the ergonomic design sets back remotes by 20 years. The buttons are all the same, there's no separation of the number buttons, the volume and channel buttons are the same shape as the other buttons on the remote, and it all adds up to a crappy user experience. Why would TWC accept this as a replacement? I'm skipping this and paying double for a refurbished Synergy V.", "summary": "Ergonomic nightmare", "unixReviewTime": 1397433600} 

{"reviewerID": "3689286c8658f54a2ff7aa68ce589c81f6cae4c4d9de76fa0f66d5c114f79837", "asin": "8939d791e9dd035aa58da024ace69b20d651cea4adf6159d984872b44f663301", "overall": 4.0, "helpful": {"nHelpful": 21, "outOf": 22}, "reviewText": "This is a great truck GPS. I've tried others and nothing seems to come close to the Rand McNally TND-700.Excellent screen size and resolution. The audio is loud enough to be heard over road noise and the purr of my Kenworth/Cat engine. I've used it for the last 8,000 miles or so and it has only glitched once. Just restarted it and it picked up on my route right where it should have.Clean up the minor issues and this unit rates a solid 5.Rand McNally 528881469 7-inch Intelliroute TND 700 Truck GPS", "summary": "Great Unit!", "unixReviewTime": 1280016000} 

我嘗試使用下面的代碼將其轉換爲一個熊貓數據幀:

train_df = pd.DataFrame() 
count = 0; 
for l in open('train.json'): 
    try: 
     count +=1 
     if(count==20001): 
      break 
     obj1 = json.loads(l) 
     df1=pd.DataFrame(obj1, index=[0]) 
     train_df = train_df.append(df1, ignore_index=True) 
    except ValueError: 
     line = line.replace('\\','') 
     obj = json.loads(line) 
     df1=pd.DataFrame(obj, index=[0]) 
     train_df = train_df.append(df1, ignore_index=True) 

然而,它給了我'NaN'嵌套值,即'有用的'屬性。我想要的輸出,使兩個嵌套屬性的鍵是一個單獨的列。

編輯:

P.S:我使用try /除外,因爲我有「\」字符在某些對象,給了我一個JSON解碼錯誤。

任何人都可以幫忙嗎?有沒有其他方法可以使用?

謝謝。

+0

你試過'pandas.read_json'? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html – DeepSpace

+0

@DeepSpace是的,我有。它給我錯誤說ValueError:'尾隨數據' –

+0

尾隨數據意味着您的文件中有不是json對象的一部分的額外數據。看看你的文件,並確保它是所有有效的json。 – RichSmith

回答

4

這對大量JSON對象的合理速度執行字典的名單上使用json_normalize

from pandas.io.json import json_normalize 

my_list = [] 
with open('train.json') as f: 
    for line in f: 
     line = line.replace('\\','') 
     my_list.append(json.loads(line)) 

# avoid transposing if you want to keep keys as columns of the dataframe 
result_df = json_normalize(my_list).T 

enter image description here

0

嘗試:

pd.concat([pd.Series(json.loads(line)) for line in open('train.json')], axis=1) 

enter image description here

+0

這似乎工作。有沒有辦法讓我可以爲前100個對象做上述解決方案,並將它們存儲在單獨的數據框中?該文件非常大,我無法運行上述解決方案來運行整個文件。另外,有沒有一種方法可以使用try /除此之外?因爲我在一些對象中有一個'\',它給了我一個JsonDecodeError –