2017-05-13 15 views
0

我需要一個解決方案,用於從非唯一列表中提取唯一元素並計數其重複元素。爲什麼在某些情況下,字典的計數比collection.Counter快?

該解決方案的目的是將其用於從非唯一列表創建唯一組合的算法中。在這種情況下創建組合的列表大小通常非常小(少於50個元素),但我的目標是找到總體最快的代碼,儘可能隨時隨地進行優化(即使只獲得極少量的運行時間)。

蟒蛇collections模塊提供了一個確切的適合專業collections.Counter這樣的目的,但也有明顯的情況下,其中一個簡單的字典,而不是collections.Counter的使用會導致更快的代碼一樣,你可以檢查出自己使用下面的代碼:

from time import time as t 
from timeit import timeit as tt 

from collections import Counter 
def counter(iterable): 
    dctCounter = {} 
    for item in iterable: 
     if item in dctCounter: 
      dctCounter[item] += 1 
     else: 
      dctCounter[item] = 1 
    return dctCounter 

for n, N in [(1,10), (10,1), (1,50), (50,1), (1,100), (100,1), (1,200), (200, 1), (1, 500), (500, 1), (1, 1000), (1000,1)]: 
    lstItems = n*list(range(N)) 
    for noLoops in [10**p for p in range(5, 6)]: 
     s = t() 
     for _ in range(noLoops): 
      dctCounter = counter(lstItems) 
     e = t() 
     timeDctFctn = e - s 
     s = t() 
     for _ in range(noLoops): 
      objCounter = Counter(lstItems) 
     e = t() 
     timeCollCtr = e - s 
     timeitCollCtr = tt("objCounter=Counter(lstItems)", "from __main__ import Counter, lstItems", number=noLoops) 
     timeitDctFctn = tt("dctCounter=counter(lstItems)", "from __main__ import counter, lstItems", number=noLoops) 
     # print("Loops: {:7}, time/timeit CollCtr: {:7.5f}/{:7.5f} DctFctn: {:7.5f}/{:7.5f} sec. lstSize: {:3}, %uniq: {:3.0f}".format(noLoops, timeCollCtr, timeitCollCtr, timeDctFctn, timeitDctFctn, n*N, 100.0/n)) 
     print("collections.Counter(): {:7.5f}, def counter(): {:7.5f} sec. lstSize: {:3}, %uniq: {:3.0f}, ({} timitLoops)".format(timeitCollCtr, timeitDctFctn, n*N, 100.0/n, noLoops))  
     # print('-----------------------------------------------------------------------------------------------------------') 

這裏輸出:

python3.6 -u "collections.Counter-vs-dictionaryAsCounter_Cg.py" 
collections.Counter(): 0.36461, def counter(): 0.09592 sec. lstSize: 10, %uniq: 100, (100000 timitLoops) 
collections.Counter(): 0.36444, def counter(): 0.12286 sec. lstSize: 10, %uniq: 10, (100000 timitLoops) 
collections.Counter(): 0.58627, def counter(): 0.43233 sec. lstSize: 50, %uniq: 100, (100000 timitLoops) 
collections.Counter(): 0.52399, def counter(): 0.54106 sec. lstSize: 50, %uniq: 2, (100000 timitLoops) 
collections.Counter(): 0.82332, def counter(): 0.81436 sec. lstSize: 100, %uniq: 100, (100000 timitLoops) 
collections.Counter(): 0.72513, def counter(): 1.06823 sec. lstSize: 100, %uniq: 1, (100000 timitLoops) 
collections.Counter(): 1.27130, def counter(): 1.59476 sec. lstSize: 200, %uniq: 100, (100000 timitLoops) 
collections.Counter(): 1.13817, def counter(): 2.14566 sec. lstSize: 200, %uniq: 0, (100000 timitLoops) 
collections.Counter(): 3.16287, def counter(): 4.26738 sec. lstSize: 500, %uniq: 100, (100000 timitLoops) 
collections.Counter(): 2.64247, def counter(): 5.67448 sec. lstSize: 500, %uniq: 0, (100000 timitLoops) 
collections.Counter(): 4.89153, def counter(): 7.68661 sec. lstSize:1000, %uniq: 100, (100000 timitLoops) 
collections.Counter(): 6.06389, def counter():13.92613 sec. lstSize:1000, %uniq: 0, (100000 timitLoops) 
>Exit code: 0 

PS似乎collections.Counter()在上面的另一個上下文中也沒有達到期望值。在這裏看到:https://stackoverflow.com/questions/41594940/why-is-collections-counter-so-slow

+2

另外,計算50個元素實際上太小了一些要測試的元素。至少從一千或一萬個元素開始。 –

+0

@MartijnPieters查看更新的問題,考慮到您的意見,使您的評論過時。 – Claudio

回答

7

Counter有一個主要的瓶頸,當你數短iterables:它檢查if isinstance(iterable, Mapping)。這個測試是相當緩慢的,因爲collections.abc.Mappingabstract metaclass並且這樣isinstance -check是有點比正常isinstance -checks更復雜,例如參見:"Why is checking isinstance(something, Mapping) so slow?"

所以它是不是真的令人驚訝,如果其他方法都更快簡短的迭代。然而,對於長迭代文件,檢查無關緊要,並且Counter應該更快(至少對於其中實際計數功能_count_elements寫在中的python-3.x(CPython))。

找出瓶頸的簡單方法是分析。我在這裏使用line_profiler

%load_ext line_profiler 

from collections import Counter 
x = range(50) 
# Profile the function Counter.update when executing the command "Counter(x)" 
%lprun -f Counter.update Counter(x) 

結果:

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    604   1   8  8.0  3.9   if not args: 
    605              raise TypeError("descriptor 'update' of 'Counter' object " 
    606                  "needs an argument") 
    607   1   13  13.0  6.4   self, *args = args 
    608   1   6  6.0  3.0   if len(args) > 1: 
    609              raise TypeError('expected at most 1 arguments, got %d' % len(args)) 
    610   1   5  5.0  2.5   iterable = args[0] if args else None 
    611   1   3  3.0  1.5   if iterable is not None: 
    612   1   94  94.0  46.3    if isinstance(iterable, Mapping): 
    613               if self: 
    614                self_get = self.get 
    615                for elem, count in iterable.items(): 
    616                 self[elem] = count + self_get(elem, 0) 
    617               else: 
    618                super(Counter, self).update(iterable) # fast path when counter is empty 
    619              else: 
    620   1   69  69.0  34.0     _count_elements(self, iterable) 
    621   1   5  5.0  2.5   if kwds: 
    622              self.update(kwds) 

所以花費的時間從一個迭代初始化Counter有一個相當大的常數因子(時46%都花在在isinstance檢查中,使用計數字典只需要34%)。

但是長期iterables沒關係(多),因爲它只是做一次:

%lprun -f Counter.update Counter([1]*100000) 

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    604   1   12  12.0  0.0   if not args: 
    605              raise TypeError("descriptor 'update' of 'Counter' object " 
    606                  "needs an argument") 
    607   1   12  12.0  0.0   self, *args = args 
    608   1   6  6.0  0.0   if len(args) > 1: 
    609              raise TypeError('expected at most 1 arguments, got %d' % len(args)) 
    610   1   6  6.0  0.0   iterable = args[0] if args else None 
    611   1   3  3.0  0.0   if iterable is not None: 
    612   1   97  97.0  0.3    if isinstance(iterable, Mapping): 
    613               if self: 
    614                self_get = self.get 
    615                for elem, count in iterable.items(): 
    616                 self[elem] = count + self_get(elem, 0) 
    617               else: 
    618                super(Counter, self).update(iterable) # fast path when counter is empty 
    619              else: 
    620   1  28114 28114.0  99.5     _count_elements(self, iterable) 
    621   1   13  13.0  0.0   if kwds: 
    622              self.update(kwds) 

只給你如何將這些執行的概述根據元素的數量,比較我包括一個優化版本的計數以及Counter使用的_count_elements函數。不過我排除在外,你sorted的項目,創造了計數的列表,以避免其他影響的部分 - 尤其是因爲sorted比計算不同的運行時行爲(O(n log(n)))(O(n)):

# Setup 

import random 
from collections import Counter, _count_elements 

def count(iterable): 
    """Explicit iteration over items.""" 
    dctCounter = {} 
    for item in iterable: 
     if item in dctCounter: 
      dctCounter[item] += 1 
     else: 
      dctCounter[item] = 1 
    return dctCounter 

def count2(iterable): 
    """Iterating over the indices""" 
    dctCounter = {} 
    lenLstItems = len(iterable) 
    for idx in range(lenLstItems): 
     item = iterable[idx] 
     if item in dctCounter.keys(): 
      dctCounter[item] += 1 
     else: 
      dctCounter[item] = 1 
    return dctCounter 

def c_count(iterable): 
    """Internal counting function that's used by Counter""" 
    d = {} 
    _count_elements(d, iterable) 
    return d 

# Timing 

timings = {Counter: [], count: [], count2: [], c_count: []} 

for i in range(1, 20): 
    print(2**i) 
    it = [random.randint(0, 2**i) for _ in range(2**i)] 
    for func in (Counter, count, count2, c_count): 
     res = %timeit -o func(it) 
     timings[func].append(res) 

# Plotting 

%matplotlib notebook 

import matplotlib.pyplot as plt 
import numpy as np 

fig = plt.figure(1) 
ax = plt.subplot(111) 

n = 2**np.arange(1, 5) 

ax.plot(n, 
     [time.average for time in timings[count]], 
     label='my custom function', c='red') 
ax.plot(n, 
     [time.average for time in timings[count2]], 
     label='your custom function', c='green') 
ax.plot(n, 
     [time.average for time in timings[Counter]], 
     label='Counter', c='blue') 
ax.plot(n, 
     [time.average for time in timings[c_count]], 
     label='_count_elements', c='purple') 
ax.set_xscale('log') 
ax.set_yscale('log') 
ax.set_xlabel('elements') 
ax.set_ylabel('time to count them [seconds]') 
ax.grid(which='both') 
ax.legend() 
plt.tight_layout() 

而結果:

enter image description here

個人定時:

2 
30.5 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
1.67 µs ± 3.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 
6.03 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
1.67 µs ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 
4 
30.7 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
2.63 µs ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
7.81 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
1.97 µs ± 5.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 
8 
34.3 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
4.3 µs ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
11.3 µs ± 23.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
3 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
16 
34.2 µs ± 599 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
7.46 µs ± 42 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
17.5 µs ± 83.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
4.24 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
32 
38.4 µs ± 578 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
13.7 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
29.8 µs ± 383 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
7.56 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
64 
43.5 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
24 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
52.8 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
11.6 µs ± 83.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 
128 
53.5 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
47.8 µs ± 507 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
101 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
21.7 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
256 
69.6 µs ± 239 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
92.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
188 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
39.5 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
512 
123 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
200 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
409 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
90.9 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
1024 
230 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
428 µs ± 5.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
855 µs ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
193 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 
2048 
436 µs ± 7.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
868 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
1.76 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
386 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
4096 
830 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
1.8 ms ± 33.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
3.75 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
1.06 ms ± 89.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
8192 
2.3 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
3.8 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
7.8 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
1.69 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 
16384 
4.53 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
8.22 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
15.9 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
3.9 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
32768 
9.6 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
17.2 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
32.5 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
10.4 ms ± 687 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 
65536 
24.8 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
40.1 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
66.8 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
24.5 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
131072 
54.6 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
84.2 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
142 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
54.1 ms ± 424 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
262144 
120 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
182 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
296 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
117 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 
524288 
244 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
368 ms ± 2.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
601 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
252 ms ± 6.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
+0

@Claudio我的'count'實現比你的實現快得多,這也許可以解釋爲什麼你測量的時間差別甚至是相反的。如果你想,我也可以在我的時間包括你的功能。讓我知道我是否應該更新答案。 (你也可以嘗試在答案中運行代碼來驗證結果.IPython不是一個複雜的依賴項:)) – MSeifert

+0

根據我最近的發現,不僅列表的大小,而且列表中有多少重複項對哪種計數方法更快有影響。除此之外,我很難相信'timeit'結果。 – Claudio

+0

@Claudio你可能想看看['repeat']中的註釋(https://docs.python.org/library/timeit.html#timeit.Timer.repeat):「這很誘人計算mean和標準偏差與結果向量相關並報告它們,但這並不是很有用,在典型情況下,**最低值**給出了機器運行給定代碼片段的速度的下限;結果向量通常不是由Python的速度變化引起的,而是由其他過程干擾您的定時精度。「(1/2) – MSeifert

相關問題