2016-03-08 72 views
8

有一個C++相比,從列表的列表獲取列表的工會:The fastest way to find union of sets的最快方法 - Python的

而且還有其他幾個蟒蛇相關的問題,但沒有提出建立工會的名單最快的方法:

從答案,我收集了該療法e爲至少2種方式來做到這一點:

>>> from itertools import chain 
>>> x = [[1,2,3], [3,4,5], [1,7,8]] 
>>> list(set().union(*x)) 
[1, 2, 3, 4, 5, 7, 8] 
>>> list(set(chain(*x))) 
[1, 2, 3, 4, 5, 7, 8] 

請注意,我鑄造一套事後列出,因爲我需要在列表的順序是固定作進一步處理。

經過一番比較後,好像list(set(chain(*x)))更穩定,花費較少的時間:

from itertools import chain 
import time 
import random 

# Dry run. 
x = [[random.choice(range(10000)) 
    for i in range(10)] for j in range(10)] 
list(set().union(*x)) 
list(set(chain(*x))) 

y_time = 0 
z_time = 0 

for _ in range(1000): 
    x = [[random.choice(range(10000)) 
     for i in range(10)] for j in range(10)] 
    start = time.time() 
    y = list(set().union(*x)) 
    y_time += time.time() - start 
    #print 'list(set().union(*x)):\t', y_time 
    start = time.time() 
    z = list(set(chain(*x))) 
    z_time += time.time() - start 
    #print 'list(set(chain(*x))):\t', z_time 
    assert sorted(y) == sorted(z) 
    #print 

print y_time/1000. 
print z_time/1000. 

[出]:

1.39586925507e-05 
1.09834671021e-05 

取出鑄造套,以列表的變量:

y_time = 0 
z_time = 0 

for _ in range(1000): 
    x = [[random.choice(range(10000)) 
     for i in range(10)] for j in range(10)] 

    start = time.time() 
    y = set().union(*x) 
    y_time += time.time() - start 

    start = time.time() 
    z = set(chain(*x)) 
    z_time += time.time() - start 

    assert sorted(y) == sorted(z) 

print y_time/1000. 
print z_time/1000. 

[out]:

1.22241973877e-05 
1.02684497833e-05 

下面是完整的輸出,當我嘗試打印中間計時(不含名單鑄造):http://pastebin.com/raw/y3i6dXZ8

爲什麼是它list(set(chain(*x)))花費較少的時間比list(set().union(*x))

是否有另一種方法來實現相同的列表聯合?使用numpypandassframe什麼的?替代方案是否更快?

+0

的內部列表排序? – fl00r

+0

不,內部列表沒有明確排序。假定列表的輸入列表的順序爲未知。 – alvas

回答

7

什麼是最快取決於x本質 - 無論它是一個長長的清單或名單,有許多子列表或幾個子列表,子列表是否是長或短,是否有很多重複或幾個副本。

以下是一些timeit結果比較一些替代方案。有這麼多的可能性,這絕不是一個完整的分析,但也許這會給你一個研究你的用例的框架。

func     | x     | time 
unique_concatenate | many_uniques   | 0.863 
empty_set_union  | many_uniques   | 1.191 
short_set_union_rest | many_uniques   | 1.192 
long_set_union_rest | many_uniques   | 1.194 
set_chain   | many_uniques   | 1.224 

func     | x     | time 
long_set_union_rest | many_duplicates  | 0.958 
short_set_union_rest | many_duplicates  | 0.969 
empty_set_union  | many_duplicates  | 0.971 
set_chain   | many_duplicates  | 1.128 
unique_concatenate | many_duplicates  | 2.411 

func     | x     | time 
empty_set_union  | many_small_lists  | 1.023 
long_set_union_rest | many_small_lists  | 1.028 
set_chain   | many_small_lists  | 1.032 
short_set_union_rest | many_small_lists  | 1.036 
unique_concatenate | many_small_lists  | 1.351 

func     | x     | time 
long_set_union_rest | few_large_lists  | 0.791 
empty_set_union  | few_large_lists  | 0.813 
unique_concatenate | few_large_lists  | 0.814 
set_chain   | few_large_lists  | 0.829 
short_set_union_rest | few_large_lists  | 0.849 

一定要在自己的機器上運行timeit基準測試,因爲結果可能會有所不同。


from __future__ import print_function 
import random 
import timeit 
from itertools import chain 
import numpy as np 

def unique_concatenate(x): 
    return np.unique(np.concatenate(x)) 

def short_set_union_rest(x): 
    # This assumes x[0] is the shortest list in x 
    return list(set(x[0]).union(*x[1:])) 

def long_set_union_rest(x): 
    # This assumes x[-1] is the longest list in x 
    return list(set(x[-1]).union(*x[1:])) 

def empty_set_union(x): 
    return list(set().union(*x)) 

def set_chain(x): 
    return list(set(chain(*x))) 


big_range = list(range(10**7)) 
small_range = list(range(10**5)) 
many_uniques = [[random.choice(big_range) for i in range(j)] 
       for j in range(10, 10000, 10)] 
many_duplicates = [[random.choice(small_range) for i in range(j)] 
       for j in range(10, 10000, 10)] 
many_small_lists = [[random.choice(big_range) for i in range(10)] 
        for j in range(10, 10000, 10)] 
few_large_lists = [[random.choice(big_range) for i in range(1000)] 
        for j in range(10, 100, 10)] 

if __name__=='__main__': 
    for x, n in [('many_uniques', 1), ('many_duplicates', 4), 
       ('many_small_lists', 800), ('few_large_lists', 800)]: 
     timing = dict() 
     for func in [ 
       'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest', 
       'empty_set_union', 'set_chain']: 
      timing[func, x] = timeit.timeit(
       '{}({})'.format(func, x), number=n, 
       setup='from __main__ import {}, {}'.format(func, x)) 
     print('{:20} | {:20} | {}'.format('func', 'x', 'time')) 
     for key, t in sorted(timing.items(), key=lambda item: item[1]): 
      func, x = key 
      print('{:20} | {:20} | {:.3f}'.format(func, x, t)) 
     print(end='\n')