,有一個以上的字符串,如果更快的選項條件或lambda過濾器。它使用邏輯索引:
def indexingversion(data, bin_start, bin_end, bin_step):
x = np.array(data)
bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step)
bin_number = bin_edges.size - 1
cond = np.zeros((x.size, bin_number), dtype=bool)
for i in range(bin_number):
cond[:, i] = np.logical_and(bin_edges[i] < x,
x < bin_edges[i+1])
return [list(x[cond[:, i]]) for i in range(bin_number)]
我已經把迄今所有的解決方案和我自己的功能版本,跑一次全部,使用線分析器(rkern/line_profiler)。最後一行證明了所有三個輸出是相同的(這使得我的版本稍微有些變化,因爲我必須在開始時將它轉換爲numpy數組,並且最終返回)。
我的版本和lambda版本還有另外一個好處,您可以將它們分組到其他分檔中,您必須在第一個解決方案中重寫if
-statements。
import numpy as np
def forloop(x):
data_200_300 = []
data_300_400 = []
for i in x:
if 200 < i < 300:
data_200_300.append(i)
elif 300 < i < 400:
data_300_400.append(i)
return [data_200_300, data_300_400]
def lambdaversion(data, bin_start, bin_end, bin_step):
filtered_data = []
for i in range(bin_start,bin_end,bin_step):
filtered_data.append(filter(lambda x: i < x < i+bin_step, data))
return filtered_data
def indexingversion(data, bin_start, bin_end, bin_step):
x = np.array(data)
bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step)
bin_number = bin_edges.size - 1
cond = np.zeros((x.size, bin_number), dtype=bool)
for i in range(bin_number):
cond[:, i] = np.logical_and(bin_edges[i] < x,
x < bin_edges[i+1])
return [list(x[cond[:, i]]) for i in range(bin_number)]
#@profile
def run_all():
n = 100000
x = np.random.random_integers(200, 400, n) + np.random.ranf(n)
bin_start = 200
bin_end = 400
bin_step = 100
a = forloop(x)
b = lambdaversion(x, bin_start, bin_end, bin_step)
c = indexingversion(x, bin_start, bin_end, bin_step)
print('All the same? - ' + str(a == b == c))
if __name__ == '__main__':
run_all()
仿形輸出:
All the same? - True
Wrote profile results to bla.py.lprof
Timer unit: 1e-06 s
Total time: 0.580098 s
File: bla.py
Function: run_all at line 32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
32 @profile
33 def run_all():
34 1 1 1.0 0.0 n = 100000
35 1 3311 3311.0 0.6 x = np.random.random_integers(200, 400, n) + np.random.ranf(n)
36 1 2 2.0 0.0 bin_start = 200
37 1 1 1.0 0.0 bin_end = 400
38 1 0 0.0 0.0 bin_step = 100
39 1 263073 263073.0 45.3 a = forloop(x)
40 1 301819 301819.0 52.0 b = lambdaversion(x, bin_start, bin_end, bin_step)
41 1 7514 7514.0 1.3 c = indexingversion(x, bin_start, bin_end, bin_step)
42 1 4377 4377.0 0.8 print('All the same? - ' + str(a == b == c))
正如你可以看到(在Time
或% Time
柱)時,numpy的索引爲約40或50倍的因數更快,至少100,000號。但是,對於非常小的數值,它會變慢(在我的機器上,它的啓動速度約爲40個值)。
如果我想到組中的像 DF = [ID,V1,V2,V3 1,12,32,23 2,65,45,22 3,55,34,76 列。 ..] 如果我想基於V3 colunn組合,我該怎麼辦? –