2014-10-09 55 views
0

查找表比較,我有兩隻大熊貓dataframes,它們之間加快所有到所有與numpy的和/或大熊貓

n_classes = 100 
classes = range(n_classes) 
activity_data = pd.DataFrame(columns=['Class','Activity'], data=list(zip(classes,rand(n_classes)))) 

weight_lookuptable = pd.DataFrame(index=classes, columns=classes, data=rand(n_classes,n_classes)) 
#Important for comprehension: the classes are both the indices and the columns. Every class has a relationship with every other class. 

然後我想執行此操作的一些常見信息:

q =[sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']] 

描述:對於每個類,在查找表中查找該類的類與類之間的權重,並將它們乘以它們各自的類。然後總結。

有沒有更聰明的方法來做到這一點,以便更快?現在速度非常快,但我會做這個數百萬次,真的可以減少一個數量級或兩個數量級。

也許有巧妙的做法activity_data['Class']和索引。但顯然獲得收益的最大機會是沒有for c in activity_data['Class']組件。我只是不知道該怎麼做。

+0

這是2 x更快,但結果略有不同:'(activity_data ['Activity'] *(activity_data ['Class']。map(lambda x:weight_lookuptable [x])))。sum()' – EdChum 2014-10-09 10:23:07

回答

1

IIUC,你可以使用dot,我想:

>>> q = [sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']] 
>>> new_q = activity_data["Activity"].dot(weight_lookuptable) 
>>> np.allclose(q, new_q) 
True 

這是快得多我:

>>> %timeit q = [sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']] 
10 loops, best of 3: 28.8 ms per loop 
>>> %timeit new_q = activity_data["Activity"].dot(weight_lookuptable) 
1000 loops, best of 3: 218 µs per loop 

你有時可以通過刪除裸numpy的(擠壓出更多的性能雖然那麼你必須更小心,以確保您的指數對齊):

>>> %timeit new_q = activity_data["Activity"].values.dot(weight_lookuptable.values) 
10000 loops, best of 3: 43.4 µs per loop