2017-02-10 55 views
1

我已經有一個同事問我從「Yelp數據集挑戰」中將6個巨大文件從有點「平坦」的普通JSON轉換爲CSV (他認爲它們看起來像有趣的教學資料)Python性能調優:JSON到CSV,大文件

我想我可以一鼓作氣出來:

# With thanks to http://www.diveintopython3.net/files.html and https://www.reddit.com/r/MachineLearning/comments/33eglq/python_help_jsoncsv_pandas/cqkwyu8/ 

import os 
import pandas 

jsondir = 'c:\\example\\bigfiles\\' 
csvdir = 'c:\\example\\bigcsvfiles\\' 
if not os.path.exists(csvdir): os.makedirs(csvdir) 

for file in os.listdir(jsondir): 
    with open(jsondir+file, 'r', encoding='utf-8') as f: data = f.readlines() 
    df = pandas.read_json('[' + ','.join(map(lambda x: x.rstrip(), data)) + ']') 
    df.to_csv(csvdir+os.path.splitext(file)[0]+'.csv',index=0,quoting=1) 

不幸的是,我的電腦的內存是不達標的任務在這個尺寸的文件。 (即使我擺脫了循環,雖然它在不到一分鐘的時間內甩出了一個50MB的文件,但它努力避免凍結我的電腦或崩潰在100MB +文件上,而最大的文件是3.25GB。)

是否還有其他簡單但性能可以運行的東西?

在循環中會很好,但如果它對內存有影響(只有6個文件),我也可以運行6次w /單獨的文件名。

下面是一個「.json」文件內容的例子 - 注意每個文件實際上有很多JSON對象,每行1個。

{"business_id":"xyzzy","name":"Business A","neighborhood":"","address":"XX YY ZZ","city":"Tempe","state":"AZ","postal_code":"85283","latitude":33.32823894longitude":-111.28948,"stars":3,"review_count":3,"is_open":0,"attributes":["BikeParking: True","BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","DogsAllowed: False","RestaurantsPriceRange2: 2","WheelchairAccessible: True"],"categories":["Tobacco Shops","Nightlife","Vape Shops","Shopping"],"hours":["Monday 11:0-21:0","Tuesday 11:0-21:0","Wednesday 11:0-21:0","Thursday 11:0-21:0","Friday 11:0-22:0","Saturday 10:0-22:0","Sunday 11:0-18:0"],"type":"business"} 
{"business_id":"dsfiuweio2f","name":"Some Place","neighborhood":"","address":"Strip or something","city":"Las Vegas","state":"NV","postal_code":"89106","latitude":36.189134,"longitude":-115.92094,"stars":1.5,"review_count":2,"is_open":1,"attributes":["BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True"],"categories":["Caterers","Grocery","Food","Event Planning & Services","Party & Event Planning","Specialty Food"],"hours":["Monday 0:0-0:0","Tuesday 0:0-0:0","Wednesday 0:0-0:0","Thursday 0:0-0:0","Friday 0:0-0:0","Saturday 0:0-0:0","Sunday 0:0-0:0"],"type":"business"} 
{"business_id":"abccb","name":"La la la","neighborhood":"Blah blah","address":"Yay that","city":"Toronto","state":"ON","postal_code":"M6H 1L5","latitude":43.283984,"longitude":-79.28284,"stars":2,"review_count":6,"is_open":1,"attributes":["Alcohol: none","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}","BikeParking: True","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}","Caters: True","GoodForKids: True","GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'breakfast': False, 'brunch': False}","HasTV: True","NoiseLevel: quiet","OutdoorSeating: False","RestaurantsAttire: casual","RestaurantsDelivery: True","RestaurantsGoodForGroups: True","RestaurantsPriceRange2: 1","RestaurantsReservations: False","RestaurantsTableService: False","RestaurantsTakeOut: True","WiFi: free"],"categories":["Restaurants","Pizza","Chicken Wings","Italian"],"hours":["Monday 11:0-2:0","Tuesday 11:0-2:0","Wednesday 11:0-2:0","Thursday 11:0-3:0","Friday 11:0-3:0","Saturday 11:0-3:0","Sunday 11:0-2:0"],"type":"business"} 

嵌套的JSON數據可以簡單地保留爲表示它的字符串文字 - 我只是想將頂級密鑰轉換爲CSV文件標題。

+0

而不是**一次讀取和解析整個文件**,您可以嘗試**一次閱讀一個json字典或一個csv行**,然後解析並插入到csv。這將需要更多的手動編碼,但會在文件流風格下運行良好。 –

回答

1

問題是您的代碼會將整個文件讀入內存,然後在內存中創建它的近似副本。我懷疑它也創建了第三個副本,但尚未驗證。正如Neo X所建議的那樣,解決方案是逐行讀取文件並相應地處理它。這裏是for循環的替代:

for file in os.listdir(jsondir): 
    csv_file = csvdir + os.path.splitext(file)[0] + '.csv' 
    with open(jsondir+file, 'r', encoding='utf-8') as f, open(csv_file, 'w', encoding='utf-8') as csv: 
     header = True 
     for line in f: 
      df = pandas.read_json(''.join(('[', line.rstrip(), ']'))) 
      df.to_csv(csv, header=header, index=0, quoting=1) 
      header = False 

我已經測試了這個在Mac上使用python 3.5;它應該在Windows上工作,但我沒有在那裏測試過。

注:

  1. 我已經調整了JSON數據;第一行的緯度/經度似乎有誤差。

  2. 這隻用小文件測試;我不確定從哪裏獲得3.5 GB文件。

  3. 我假設這是您朋友的一次性使用。如果這是生產代碼,則需要驗證'with'語句的異常處理是否正確。詳情請參閱How can I open multiple files using "with open" in Python?

  4. 這應該是相當高效的,但我不知道從哪裏獲取大文件。

+0

查看[ijson](https://pypi.python.org/pypi/ijson/),它使流式傳輸JSON文件與使用Python迭代器一樣簡單 – sundance

+0

@kevin:問題:爲什麼您的'to_csv( )'包含'mode ='a''參數?是否有一些關於在'open'中調用'to_csv()'來使它自動追加? 此外,您的代碼運行得非常漂亮 - 轉換一個小文件需要更長的時間,但我的電腦不再凍結,而且作業仍然在合理的時間內完成(應該在一天結束時輕鬆完成),所以我可以讓它在後臺運行。太感謝了。 (最後,我編輯了你的代碼,在輸出文件中包含了UTF-8編碼 - 我得到的錯誤w /外部輸入數據,直到我這樣做。) –

+0

@k ..很高興我可以幫助!由於'csv'已經打開並被傳遞給'to_csv()',後者在寫入後不會關閉文件。你可以通過查看源文件中的def save()來驗證;它在1476行設置'close = False'。https://github.com/pandas-dev/pandas/blob/master/pandas/formats/format.py。好問題! – kevin