2016-09-21 61 views
2

我寫了下面的腳本,但我有內存消耗,大熊貓被分配RAM的30多G,其中數據文件的總和大約是18G的大熊貓內存消耗HDF文件分組

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib 
import time 


mean_wo = pd.DataFrame() 
mean_w = pd.DataFrame() 
std_w = pd.DataFrame() 
std_wo = pd.DataFrame() 

start_time=time.time() #taking current time as starting time 

data_files=['2012.h5','2013.h5','2014.h5','2015.h5', '2016.h5', '2008_2011.h5'] 



for data_file in data_files: 
    print data_file 
    df = pd.read_hdf(data_file) 
    grouped = df.groupby('day') 
    mean_wo_tmp=grouped['Significance_without_muons'].agg([np.mean]) 
    mean_w_tmp=grouped['Significance_with_muons'].agg([np.mean]) 
    std_wo_tmp=grouped['Significance_without_muons'].agg([np.std]) 
    std_w_tmp=grouped['Significance_with_muons'].agg([np.std]) 
    mean_wo = pd.concat([mean_wo, mean_wo_tmp]) 
    mean_w = pd.concat([mean_w, mean_w_tmp]) 
    std_w = pd.concat([std_w,std_w_tmp]) 
    std_wo = pd.concat([std_wo,std_wo_tmp]) 
    print mean_wo.info() 
    print mean_w.info() 
    del df, grouped, mean_wo_tmp, mean_w_tmp, std_w_tmp, std_wo_tmp 

std_wo=std_wo.reset_index() 
std_w=std_w.reset_index() 
mean_wo=mean_wo.reset_index() 
mean_w=mean_w.reset_index() 

#setting the field day as date 
std_wo['day']= pd.to_datetime(std_wo['day'], format='%Y-%m-%d') 
std_w['day']= pd.to_datetime(std_w['day'], format='%Y-%m-%d') 
mean_w['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d') 
mean_wo['day']= pd.to_datetime(mean_w['day'], format='%Y-%m-%d') 

問題所以有人有一個想法如何減少內存消耗?

乾杯,

回答

1

我會做這樣的事情
解決方案

data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5'] 
cols = ['Significance_without_muons', 'Significance_with_muons'] 

def agg(data_file): 
    return pd.read_hdf(data_file).groupby('day')[cols].agg(['mean', 'std']) 

big_df = pd.concat([agg(fn) for fn in data_files], axis=1, keys=data_files) 

mean_wo_tmp = big_df.xs(('Significance_without_muons', 'mean'), axis=1, level=[1, 2]) 
mean_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2]) 
std_wo_tmp = big_df.xs(('Significance_without_muons', 'std'), axis=1, level=[1, 2]) 
std_w_tmp = big_df.xs(('Significance_with_muons', 'mean'), axis=1, level=[1, 2]) 

del big_df 

設置

data_files=['2012.h5', '2013.h5', '2014.h5', '2015.h5', '2016.h5', '2008_2011.h5'] 
cols = ['Significance_without_muons', 'Significance_with_muons'] 

np.random.seed([3,1415]) 
data_df = pd.DataFrame(np.random.rand(1000, 2), columns=cols) 
data_df['day'] = np.random.choice(list('ABCDEFG'), 1000) 

for fn in data_files: 
    data_df.to_hdf(fn, 'day', append=False) 

運行上述溶液
然後

mean_wo_tmp 

enter image description here

+0

非常感謝piRSquared! 我會試試你的方法,現在我在for循環的末尾添加了一個'gc.collect()',我設法在25 G的閾值內運行它。 我會讓你知道如果你的方式更好:) 再次感謝! –