看起來像np.unique
uniq, inv = np.unique(x, return_inverse=True)
result = np.zeros((len(x), len(uniq)), dtype=int)
result[np.arange(len(x)), inv] = 1
工作
響應於@ Divakar的基準:這裏是一個提供更多的信息的比較確認輕微的速度優勢dv
在小字母,圍繞K=20
跨越並反轉成用於在pp
一個K=1000
幾倍優勢。這是預期的,因爲pp
利用了單熱的稀疏性。下面,K是字母的大小,N樣本的長度。
import numpy as np
from timeit import timeit
def pp(x):
uniq, inv = np.unique(x, return_inverse=True)
result = np.zeros((len(x), len(uniq)), dtype=int)
result[np.arange(len(x)), inv] = 1
def dv(x):
(x[:,None] == np.unique(x)).astype(int)
for K in (4, 10, 20, 40, 100, 200, 1000):
tpp, tdv = [], []
print('@ K =', K)
for N in (1000, 10000, 100000):
data = np.random.choice(np.random.random(K), N, replace=True)
tdv.append(timeit('f(a)', number=100, globals={'f': dv, 'a': data}))
tpp.append(timeit('f(a)', number=100, globals={'f': pp, 'a': data}))
print('dv:', '{:.6f}, {:.6f}, {:.6f}'.format(*tdv), 'secs for 100 trials @ N = 1000, 10000, 100000')
print('pp:', '{:.6f}, {:.6f}, {:.6f}'.format(*tpp), 'secs for 100 trials @ N = 1000, 10000, 100000')
打印:
@ K = 4
dv: 0.003458, 0.038176, 0.421894 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.004856, 0.052298, 0.603758 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 10
dv: 0.005136, 0.056491, 0.663157 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.005955, 0.054069, 0.719152 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 20
dv: 0.007201, 0.084867, 0.988886 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.007638, 0.084580, 0.891122 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 40
dv: 0.010748, 0.130974, 1.498022 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.009321, 0.103912, 1.080271 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 100
dv: 0.025357, 0.292930, 2.946326 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.011916, 0.147117, 1.641588 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 200
dv: 0.033651, 0.560753, 6.042001 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.022971, 0.221142, 3.580255 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 1000
dv: 0.156715, 2.655647, 37.112166 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.055516, 0.920938, 10.358050 secs for 100 trials @ N = 1000, 10000, 100000
使用uint8
,並允許@ Divakar的方法,使用更便宜的視圖鑄造:
@ K = 4
dv: 0.003092, 0.038149, 0.386140 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.004392, 0.043327, 0.554253 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 10
dv: 0.004604, 0.054215, 0.501708 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.004930, 0.051555, 0.607239 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 20
dv: 0.006421, 0.067397, 0.665465 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.006616, 0.054055, 0.703260 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 40
dv: 0.008857, 0.087155, 0.862316 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.006945, 0.060408, 0.733966 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 100
dv: 0.015660, 0.142464, 1.426929 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.008063, 0.070860, 0.908615 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 200
dv: 0.025631, 0.235712, 2.401750 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.008805, 0.101772, 1.111652 secs for 100 trials @ N = 1000, 10000, 100000
@ K = 1000
dv: 0.069953, 1.024585, 11.313402 secs for 100 trials @ N = 1000, 10000, 100000
pp: 0.011558, 0.182684, 2.201837 secs for 100 trials @ N = 1000, 10000, 100000
你是否分析了代碼,看看瓶頸在哪裏?通過使用vanilla'list'和'set'類,我覺得你可能會否定numpy可以擁有的一些好處。也許有些numpy等價物可以做你想做的事情,並且讓操作更快一些。 –
@PaulRooney不知道分析器,如果我正在讀取它的時間最長的範圍?我已經在問題 – UberStuper