在Python中讀取數千個JSON文件的最快方法

我有一些我需要分析的JSON文件。我正在使用iPython（Python 3.5.2 | IPython 5.0.0），將文件讀入字典並將每個字典附加到列表中。在Python中讀取數千個JSON文件的最快方法

我的主要瓶頸是在文件中讀取。一些文件較小，並且可以快速讀取，但是較大的文件正在減慢我的速度。

下面是一些示例代碼（對不起，我不能提供實際數據文件）：

import json 
import glob 

def read_json_files(path_to_file): 
    with open(path_to_file) as p: 
     data = json.load(p) 
     p.close() 
    return data 

def giant_list(json_files): 
    data_list = [] 
    for f in json_files: 
     data_list.append(read_json_files(f)) 
    return data_list 

support_files = glob.glob('/Users/path/to/support_tickets_*.json') 
small_file_test = giant_list(support_files) 

event_files = glob.glob('/Users/path/to/google_analytics_data_*.json') 
large_file_test = giant_list(event_files)

的支持票是非常小的規模 - 最大的我所看到的是6KB。所以，這個代碼運行非常快：

In [3]: len(support_files) 
Out[3]: 5278 

In [5]: %timeit giant_list(support_files) 
1 loop, best of 3: 557 ms per loop

但更大的文件肯定正在放緩我失望......這些事件的文件可以達到每〜2.5MB：

In [7]: len(event_files) # there will be a lot more of these soon :-/ 
Out[7]: 397 

In [8]: %timeit giant_list(event_files) 
1 loop, best of 3: 14.2 s per loop

我已經研究瞭如何加快這一進程和整個this post來了，但是，使用UltraJSON當時間只是略差：

In [3]: %timeit giant_list(traffic_files) 
1 loop, best of 3: 16.3 s per loop

SimpleJSON沒有做的更好：

In [4]: %timeit giant_list(traffic_files) 
1 loop, best of 3: 16.3 s per loop

有關如何優化此代碼並更有效地將大量JSON文件讀入Python的任何提示，非常感謝。

最後，this post是我發現的最接近我的問題，但涉及一個巨大的JSON文件，並不是很多較小的JSON文件。

來源

2016-10-04 measureallthethings

您的瓶頸是I/O，而不是解析速度。除了獲得更快的磁盤之外，還有很多工作要做（您是否在SSD上運行？）。 –

而Python庫中的'json'與'simplejson'完全相同的項目。 –

@MartijnPieters你是如何得出這個結論的？基於一些快速測試，'json.load（）'在快速CPU上達到約46MiB/s。對於基於磁盤的存儲來說，這並不是無法實現的，從來沒有想過SSD。而這就是忽略了他的輸入文件被緩存在內存中的可能性...... – marcelm

使用列表理解來避免調整大小列表多次。

def giant_list(json_files): 
    return [read_json_file(path) for path in json_files]

您關閉文件對象兩次，只需做一次（在退出with文件將被自動關閉）

def read_json_file(path_to_file): 
    with open(path_to_file) as p: 
     return json.load(p)

在這一天結束時，你的問題是I/O密集型，但這些改變會有所幫助。另外，我必須問 - 你是否真的必須同時在記憶中存儲所有這些字典？

來源

2016-10-04 16:48:40

好問題 - 我在同一時間內不需要的數千個較小的文件。在每種情況下，我都要提取5個特定字段，然後丟棄字典的其餘部分。談到大型事件文件時，我遇到了更多問題......這是Google Analytics數據和解析讓我哭了：https：//developers.google.com/analytics/devguides/reporting/core/v4/migration# parsing_the_v4_response。此外，我解析它，然後轉換爲一個熊貓DataFrame ...可能會爲另一個帖子保存： -/ – measureallthethings

更簡單：我放棄了'giant_list（）'函數，直接做一個列表理解：'[read_json_file（路徑）爲event_files]中的路徑 – measureallthethings

在Python中讀取數千個JSON文件的最快方法

回答

相關問題