2017-07-03 116 views
0

我正在處理一個非常大的數據集,並且遇到了無法找到任何答案的問題。 我試圖解析來自JSON數據,這裏是我做過什麼從整個數據集一塊的數據和工作原理:如何解析python中的BIG JSON文件

import json 

s = set() 

with open("data.raw", "r") as f: 

    for line in f: 
     d = json.loads(line) 

混亂的部分是,當我申請我的主數據代碼(大小約200G)它顯示了以下錯誤(不包括外出內存):

d = json.loads(line) 
    File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads 
    return _default_decoder.decode(s) 
    File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode 
    raise JSONDecodeError("Expecting value", s, err.value) from None 
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1) 

類型(F)= TextIOWrapper是否有幫助......但這種數據類型也爲小數據集。 ..

這裏有幾行我的數據看格式:

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}} 
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}} 
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}} 
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}} 

這是因爲Json的我已經在第一解析2000行和它完美的作品。但是,當我嘗試對大文件使用相同的過程時,它會顯示數據的第一行中的錯誤。

+0

應該對那個json數據做些什麼改變? – RomanPerekhrest

+0

'data.raw'是每個行上的json對象的json文件還是文件?如果前者使用['json.load'](https://docs.python.org/3.5/library/json.html#json.load) – Will

+0

你的文件不是有效的JSON。不過,它似乎在每一行都包含有效的JSON文本。我的建議是,修正產生這個「JSON」的東西(它實際上不是JSON)。除此之外,我想你可以一行一行地將反序列化的對象堆積成一個列表或其他東西。 –

回答

2

下面是一些簡單的代碼,看看哪些數據是無效的JSON和它在哪裏:

for i, line in enumerate(f): 
    try: 
     d = json.loads(line) 
    except json.decoder.JSONDecodeError: 
     print('Error on line', i + 1, ':\n', repr(line)) 
+0

謝謝@alex。我用這個代碼,結果很奇怪!根據結果​​,我對每條偶數行都有錯誤!但我使用了我的大文件的第一個2000行,並沒有顯示任何錯誤...這太混亂了...... – Mina

+0

@Mina你能告訴我們一個錯誤信息嗎?我特別想看到一條失敗的路線。 –

+0

你不能相信它,但那是關鍵:我在主要大文件中包含額外的輸入,這就是錯誤信息的原因!順便說一句,你的建議對我找到錯誤的根源非常有幫助。謝謝。 – Mina

1

一個很好的解決方案來讀取一個大JSON數據集,它是在python使用像yield發電機,因爲200G對於你的內存來說太大了,如果你的json解析器將整個文件存儲在內存中,一步一步地將內存與迭代器一起保存。

您可以使用迭代JSON解析器與Pythonic接口http://pypi.python.org/pypi/ijson/

但是這裏你的文件有.raw擴展名,它不是json文件。

要讀那些:

import numpy as np 

content = np.fromfile("data.raw", dtype=np.int16, sep="") 

但是這種解決方案可以爲崩潰的大文件。

如果事實.raw似乎一個.csv文件,那麼你可以像創建你的讀者:

import csv 

def read_big_file(filename): 
    with open(filename, "rb") as csvfile: 
     reader = csv.reader(csvfile) 
     for row in reader: 
      yield row 

或者像taht爲一個文本文件:

def read_big_file(filename): 
    with open(filename, "r") as _file: 
     for line in _file: 
      yield line 

使用rb只有當你的文件是二進制的。

執行:

for line in read_big_file(filename): 
    <treatment> 
    <free memory after a size of chunk> 

,我可以精確我的回答如果你給你的文件的第一行。