如何閱讀熊貓的大json？

我的代碼是：data_review=pd.read_json('review.json') 我有數據review爲fllow：如何閱讀熊貓的大json？

{ 
    // string, 22 character unique review id 
    "review_id": "zdSx_SD6obEhz9VrW9uAWA", 

    // string, 22 character unique user id, maps to the user in user.json 
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g", 

    // string, 22 character business id, maps to business in business.json 
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg", 

    // integer, star rating 
    "stars": 4, 

    // string, date formatted YYYY-MM-DD 
    "date": "2016-03-09", 

    // string, the review itself 
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.", 

    // integer, number of useful votes received 
    "useful": 0, 

    // integer, number of funny votes received 
    "funny": 0, 

    // integer, number of cool votes received 
    "cool": 0 
}

但我得到了如下錯誤：

333    fh, handles = _get_handle(filepath_or_buffer, 'r', 
    334          encoding=encoding) 
--> 335    json = fh.read() 
    336    fh.close() 
    337   else: 

OSError: [Errno 22] Invalid argument

我jsonfile不包含任何意見和3.8G！我只是從這裏下載文件到實踐link

當我使用如下代碼，拋出了同樣的錯誤：

import json 
with open('review.json') as json_file: 
    data = json.load(json_file)

來源

2017-10-17 ileadall42

你的路徑/文件參數有問題。確保文件存在於你正在運行python的文件夾中。也許在你如何調用這個腳本以及從哪裏添加更多細節。 – sascha

您不能在json文件中留言： https://stackoverflow.com/questions/244777/can-comments-be-used-in-json 您可以嘗試使用乾淨的.json文件運行代碼嗎？ –

@LukasAnsteeg我很確定它從來沒有解析json，因爲之前有一些錯誤。 – sascha

也許，你正在閱讀的文件包含多個JSON對象，而比單json或數組對象，方法json.load(json_file)和pd.read_json('review.json')正在期待。這些方法應該讀取具有單個json對象的文件。

從Yelp的數據集我都看到了，您的文件必須包含類似：

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0} 
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0} 
....  
.... 

and so on.

因此，要認識到這是不是一個JSON數據，而它是在一個文件中的多個JSON對象是非常重要的。

以讀取該數據爲大熊貓數據幀以下解決方案應該工作：

import pandas as pd 

with open('review.json') as json_file:  
    data = json_file.readlines() 
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data)

假設數據的大小是相當大的，我覺得你的機器需要相當長的時間來將數據加載到數據幀。

來源

2017-11-22 00:15:45

如何閱讀熊貓的大json？

回答

相關問題