2016-11-01 74 views
1

一組似乎檢查字典鍵作爲集是一個稍快一點:檢查存在的字典VS在python

import random 
import string 
import timeit 

repeat = 3 
numbers = 1000 

def time(statement, _setup=None): 
    print min(
     timeit.Timer(statement, setup=_setup or setup).repeat(
      repeat, numbers)) 

random.seed('slartibartfast') 

# Integers 
length = 100000 
d = {} 
for _ in range(length): 
    d[random.randint(0, 10000000)] = 0 
s = set(d) 

setup = """from __main__ import s, d, length 
""" 

time('for i in xrange(length): check = i in d') 
time('for i in xrange(length): check = i in s') 

# Strings 
d = {} 
for _ in range(length): 
    d[''.join(random.choice(string.ascii_uppercase) for __ in range(16))] = 0 
s = set(d) 

test_strings= [] 
for _ in range(length): 
    test_strings.append(random.choice(string.ascii_uppercase) for __ in range(16)) 

setup = """from __main__ import s, d, length, test_strings 
""" 

time('for i in test_strings: check = i in d') 
time('for i in test_strings: check = i in s') 

印像:

10.1242966769 
9.73939713014 
10.5156763102 
10.2767765061 

這是可以預料的或隨機神器?

想知道是否值得在性能密集型代碼中爲字典鍵創建集合。

編輯:我的測量結果真的讓我懷疑底層的實現,我不是想保存微秒,我只是好奇 - 是的,如果事實證明底層實現真的有利集,我可以做一組這些字典鍵 - 或不(我實際上是在修補遺留代碼)。

+2

你在檢查空集&字典嗎?因爲你根本不使用「隨機」。這是故意的嗎? –

+3

如果性能如此重要以至於毫秒的小部分的差異會產生差異,那麼Python可能是使用的錯誤語言。爲了便於閱讀,每100,000次迭代節省0.2秒。 –

+0

@ Jean-FrançoisFabre:複製粘貼deamon –

回答

1

可能取決於各種各樣的東西。在我跑,字典查找已稍快,但還不足以感到興奮:

In [1]: import numpy as np 

In [2]: d = {i: True for i in np.random.random(1000)} 

In [3]: s = {i for i in np.random.random(1000)} 

In [4]: checks = d.keys()[:500] + list(s)[:500] 

In [5]: %timeit [k in d for k in checks] 
10000 loops, best of 3: 83 µs per loop 

In [6]: %timeit [k in s for k in checks] 
10000 loops, best of 3: 88.4 µs per loop 

In [7]: d = {i: True for i in np.random.random(100000)} 

In [8]: s = {i for i in np.random.random(100000)} 

In [9]: checks = d.keys()[:5000] + list(s)[:5000] 

In [10]: %timeit [k in d for k in checks] 
1000 loops, best of 3: 865 µs per loop 

In [11]: %timeit [k in s for k in checks] 
1000 loops, best of 3: 929 µs per loop 
1

老實說,它在很大程度上依賴於硬件,操作系統和數據大小/約束。一般來說,性能將幾乎相同,直到您獲得真正的大數據量。注意幾個運行在這裏dict稍微好一些。在更大的數據結構尺寸時,內部實現細節開始主導差異,並且在我的機器上set往往表現更好。

現實是在大多數情況下,三角洲並不重要。如果您真的想要更好的查找性能,請考慮使用cython或​​轉移到C級操作,或者使用針對較大數據大小設計的庫實現。 Python基礎類型在達到幾百萬個元素時並不意味着性能。

>>> # With empty dict as setup in question 
>>> time('for i in xrange(length): check = i in d') 
2.83035111427 
>>> time('for i in xrange(length): check = i in s') 
2.87069892883 
>>> d = { random.random(): None for _ in xrange(100000) } 
>>> s = set(d) 
>>> time('for i in xrange(length): check = i in d') 
3.84766697884 
>>> time('for i in xrange(length): check = i in s') 
3.97955989838 
>>> d = { random.randint(0, 1000000000): None for _ in xrange(100000) } 
>>> s = set(d) 
>>> time('for i in xrange(length): check = i in d') 
3.96871709824 
>>> time('for i in xrange(length): check = i in s') 
3.62110710144 
>>> d = { random.randint(0, 1000000000): None for _ in xrange(10000000) } 
>>> s = set(d) 
>>> time('for i in xrange(length): check = i in d') 
10.6934559345 
>>> time('for i in xrange(length): check = i in s') 
5.7491569519