2016-06-08 62 views
1

我都喜歡我怎麼能基於值範圍組數據

[312.281, 
370.401, 
254.245, 
272.256, 
312.325, 
286.243, 
271.231, ...] 

數據,那麼我想通過的取值範圍組他們通過

for i in data: 
    if i in range(200,300): 
     data_200_300.append(i) 
    elif i in range(300,400): 
     data_300_400.append(i) 

它不能正常工作,有什麼代碼應該我用?

回答

0

,有一個以上的字符串,如果更快的選項條件或lambda過濾器。它使用邏輯索引:

def indexingversion(data, bin_start, bin_end, bin_step): 
    x = np.array(data) 
    bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step) 
    bin_number = bin_edges.size - 1 
    cond = np.zeros((x.size, bin_number), dtype=bool) 
    for i in range(bin_number): 
     cond[:, i] = np.logical_and(bin_edges[i] < x, 
            x < bin_edges[i+1]) 
    return [list(x[cond[:, i]]) for i in range(bin_number)] 

我已經把迄今所有的解決方案和我自己的功能版本,跑一次全部,使用線分析器(rkern/line_profiler)。最後一行證明了所有三個輸出是相同的(這使得我的版本稍微有些變化,因爲我必須在開始時將它轉換爲numpy數組,並且最終返回)。

我的版本和lambda版本還有另外一個好處,您可以將它們分組到其他分檔中,您必須在第一個解決方案中重寫if -statements。

import numpy as np 

def forloop(x): 
    data_200_300 = [] 
    data_300_400 = [] 
    for i in x: 
     if 200 < i < 300: 
      data_200_300.append(i) 
     elif 300 < i < 400: 
      data_300_400.append(i) 
    return [data_200_300, data_300_400] 


def lambdaversion(data, bin_start, bin_end, bin_step): 
    filtered_data = [] 
    for i in range(bin_start,bin_end,bin_step): 
     filtered_data.append(filter(lambda x: i < x < i+bin_step, data)) 
    return filtered_data 


def indexingversion(data, bin_start, bin_end, bin_step): 
    x = np.array(data) 
    bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step) 
    bin_number = bin_edges.size - 1 
    cond = np.zeros((x.size, bin_number), dtype=bool) 
    for i in range(bin_number): 
     cond[:, i] = np.logical_and(bin_edges[i] < x, 
            x < bin_edges[i+1]) 
    return [list(x[cond[:, i]]) for i in range(bin_number)] 


#@profile 
def run_all(): 
    n = 100000 
    x = np.random.random_integers(200, 400, n) + np.random.ranf(n) 
    bin_start = 200 
    bin_end = 400 
    bin_step = 100 
    a = forloop(x) 
    b = lambdaversion(x, bin_start, bin_end, bin_step) 
    c = indexingversion(x, bin_start, bin_end, bin_step) 
    print('All the same? - ' + str(a == b == c)) 


if __name__ == '__main__': 
    run_all() 

仿形輸出:

All the same? - True 
Wrote profile results to bla.py.lprof 
Timer unit: 1e-06 s 

Total time: 0.580098 s 
File: bla.py 
Function: run_all at line 32 

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    32           @profile 
    33           def run_all(): 
    34   1   1  1.0  0.0  n = 100000 
    35   1   3311 3311.0  0.6  x = np.random.random_integers(200, 400, n) + np.random.ranf(n) 
    36   1   2  2.0  0.0  bin_start = 200 
    37   1   1  1.0  0.0  bin_end = 400 
    38   1   0  0.0  0.0  bin_step = 100 
    39   1  263073 263073.0  45.3  a = forloop(x) 
    40   1  301819 301819.0  52.0  b = lambdaversion(x, bin_start, bin_end, bin_step) 
    41   1   7514 7514.0  1.3  c = indexingversion(x, bin_start, bin_end, bin_step) 
    42   1   4377 4377.0  0.8  print('All the same? - ' + str(a == b == c)) 

正如你可以看到(在Time% Time柱)時,numpy的索引爲約40或50倍的因數更快,至少100,000號。但是,對於非常小的數值,它會變慢(在我的機器上,它的啓動速度約爲40個值)。

3

range返回兩個數字之間的整數列表,而您的數據包含浮點數。您可以使用><這直接使用Comparisons

for i in data: 
    if 200 < i < 300: 
     data_200_300.append(i) 
    elif 300 < i < 400: 
     data_300_400.append(i) 

如果你想一些比賽是包容性,可以使用<=爲好。

+0

如果我想到組中的像 DF = [ID,V1,V2,V3 1,12,32,23 2,65,45,22 3,55,34,76 列。 ..] 如果我想基於V3 colunn組合,我該怎麼辦? –

0

@AKS正確回答了這個問題,你也可以用lambda表達式來嘗試。

result = filter(lambda x: 200 < x < 300, data) 

,如果你有很多這樣的價值觀和進口numpy的可能性,你可以使用這個喜歡它來處理你的數據

filtered_data = [] 
for i in range(200,400,100): 
    filtered_data.append(filter(lambda x: i < x < i+100, data)) 

>>> filtered_data 
[[254.245, 272.256, 286.243, 271.231], [312.281, 370.401, 312.325]]