2017-10-17 106 views
3

我的代碼是:data_review=pd.read_json('review.json') 我有數據review爲fllow:如何閱讀熊貓的大json?

{ 
    // string, 22 character unique review id 
    "review_id": "zdSx_SD6obEhz9VrW9uAWA", 

    // string, 22 character unique user id, maps to the user in user.json 
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g", 

    // string, 22 character business id, maps to business in business.json 
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg", 

    // integer, star rating 
    "stars": 4, 

    // string, date formatted YYYY-MM-DD 
    "date": "2016-03-09", 

    // string, the review itself 
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.", 

    // integer, number of useful votes received 
    "useful": 0, 

    // integer, number of funny votes received 
    "funny": 0, 

    // integer, number of cool votes received 
    "cool": 0 
} 

但我得到了如下錯誤:

333    fh, handles = _get_handle(filepath_or_buffer, 'r', 
    334          encoding=encoding) 
--> 335    json = fh.read() 
    336    fh.close() 
    337   else: 

OSError: [Errno 22] Invalid argument 

我jsonfile不包含任何意見和3.8G! 我只是從這裏下載文件到實踐link

當我使用如下代碼,拋出了同樣的錯誤:

import json 
with open('review.json') as json_file: 
    data = json.load(json_file) 
+1

你的路徑/文件參數有問題。確保文件存在於你正在運行python的文件夾中。也許在你如何調用這個腳本以及從哪裏添加更多細節。 – sascha

+0

您不能在json文件中留言: https://stackoverflow.com/questions/244777/can-comments-be-used-in-json 您可以嘗試使用乾淨的.json文件運行代碼嗎? –

+0

@LukasAnsteeg我很確定它從來沒有解析json,因爲之前有一些錯誤。 – sascha

回答

1

也許,你正在閱讀的文件包含多個JSON對象,而比單json或數組對象,方法json.load(json_file)pd.read_json('review.json')正在期待。這些方法應該讀取具有單個json對象的文件。

從Yelp的數據集我都看到了,您的文件必須包含類似:

{"review_id":"xxxxx","user_id":"xxxxx","business_id":"xxxx","stars":5,"date":"xxx-xx-xx","text":"xyxyxyxyxx","useful":0,"funny":0,"cool":0} 
{"review_id":"yyyy","user_id":"yyyyy","business_id":"yyyyy","stars":3,"date":"yyyy-yy-yy","text":"ababababab","useful":0,"funny":0,"cool":0} 
....  
.... 

and so on. 

因此,要認識到這是不是一個JSON數據,而它是在一個文件中的多個JSON對象是非常重要的。

以讀取該數據爲大熊貓數據幀以下解決方案應該工作:

import pandas as pd 

with open('review.json') as json_file:  
    data = json_file.readlines() 
    # this line below may take at least 8-10 minutes of processing for 4-5 million rows. It converts all strings in list to actual json objects. 
    data = list(map(json.loads, data)) 

pd.DataFrame(data) 

假設數據的大小是相當大的,我覺得你的機器需要相當長的時間來將數據加載到數據幀。