如何分解元組列表？

定義
因式分解：將每個唯一對象映射到唯一的整數。通常情況下，映射到的整數範圍是從零到n - 1，其中n是唯一對象的數量。兩種變化也是典型的。類型1是編號以唯一對象被識別的順序出現的地方。類型2是首先排序唯一對象的地方，然後應用與類型1中相同的過程。如何分解元組列表？

的設置
考慮的元組tups

tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]

我想這因式分解的列表爲

[0, 1, 2, 3, 4, 1, 2]

我知道有很多方法可以做到這一點。但是，我想盡可能有效地做到這一點。

我已經試過

pandas.factorize並得到一個錯誤......

pd.factorize(tups)[0] 

--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-84-c84947ac948c> in <module>() 
----> 1 pd.factorize(tups)[0] 

//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint) 
    553  uniques = vec_klass() 
    554  check_nulls = not is_integer_dtype(original) 
--> 555  labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls) 
    556 
    557  labels = _ensure_platform_int(labels) 

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21804)() 

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

或者numpy.unique，並得到不正確的結果......

np.unique(tups, return_inverse=1)[1] 

array([0, 1, 6, 7, 2, 3, 8, 4, 5, 9, 6, 7, 2, 3])

我可以使用其中任一對元組

pd.factorize([hash(t) for t in tups])[0] 

array([0, 1, 2, 3, 4, 1, 2])

耶的哈希值！這就是我想要的......那麼問題是什麼？

第一個問題
看看性能下降，由該技術

lst = [10, 7, 4, 33, 1005, 7, 4] 

%timeit pd.factorize(lst * 1000)[0] 
1000 loops, best of 3: 506 µs per loop 

%timeit pd.factorize([hash(i) for i in lst * 1000])[0] 
1000 loops, best of 3: 937 µs per loop

第二問題
散列不能保證唯一！

問題
什麼是因式分解元組的列表超快速的？

時序
兩個軸是在日誌空間

code

from itertools import count 

def champ(tups): 
    d = {} 
    c = count() 
    return np.array(
     [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] 
    ) 

def root(tups): 
    return pd.Series(tups).factorize()[0] 

def iobe(tups): 
    return np.unique(tups, return_inverse=True, axis=0)[1] 

def get_row_view(a): 
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:]))) 
    a = np.ascontiguousarray(a) 
    return a.reshape(a.shape[0], -1).view(void_dt).ravel() 

def diva(tups): 
    return np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] 

def gdib(tups): 
    return pd.factorize([str(t) for t in tups])[0] 

from string import ascii_letters 

def tups_creator_1(size, len_of_str=3, num_ints_to_choose_from=1000, seed=None): 
    c = len_of_str 
    n = num_ints_to_choose_from 
    np.random.seed(seed) 
    d = pd.DataFrame(np.random.choice(list(ascii_letters), (size, c))).sum(1).tolist() 
    i = np.random.randint(n, size=size) 
    return list(zip(d, i)) 

results = pd.DataFrame(
    index=pd.Index([100, 1000, 5000, 10000, 20000, 30000, 40000, 50000], name='Size'), 
    columns=pd.Index('champ root iobe diva gdib'.split(), name='Method') 
) 

for i in results.index: 
    tups = tups_creator_1(i, max(1, int(np.log10(i))), max(10, i // 10)) 
    for j in results.columns: 
     stmt = '{}(tups)'.format(j) 
     setup = 'from __main__ import {}, tups'.format(j) 
     results.set_value(i, j, timeit(stmt, setup, number=100)/100) 

results.plot(title='Avg Seconds', logx=True, logy=True)

來源

2017-05-26 piRSquared

你需要維持這樣的順序嗎？''[0,3,1,4,2,3,1]'也可以嗎？ – Divakar

@Divakar目前我不在乎。你可以選擇哪一個更方便。 – piRSquared

我認爲我們需要一個更好的基準測試，它既有字符串也有數字，當然應該足夠大，重複的次數應與示例中的重複次數一致。 – Divakar

初始化您的元組作爲一個系列的列表中，然後調用factorize：

pd.Series(tups).factorize()[0] 

[0 1 2 3 4 1 2]

來源

2017-05-26 18:40:31 root

這可能是最完美的答案。我唯一的懷疑是來自'pd.Series'構造的開銷。 – piRSquared

一個簡單的方法來做到這一點是使用dict舉行前幾次訪問：

>>> d = {} 
>>> [d.setdefault(tup, i) for i, tup in enumerate(tups)] 
[0, 1, 2, 3, 4, 1, 2]

如果您需要保留的數字順序則略有變化：

>>> from itertools import count 
>>> c = count() 
>>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] 
[0, 1, 2, 3, 4, 1, 2, 5]

或可選擇地寫着：

>>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups] 
[0, 1, 2, 3, 4, 1, 2, 5]

來源

2017-05-26 18:29:43 AChampion

如果按順序編號是必要的，更新！ – AChampion

這是一個非常好的答案！再次感謝。查看我的更新時間問題。 – piRSquared

方法＃1

每個元組轉換爲2D陣列的行，查看每個那些行的作爲使用一個標量NumPy的ndarray的views概念最後使用np.unique(... return_inverse=True)因式分解 -

np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]

get_row_view從here服用。

採樣運行 -

In [23]: tups 
Out[23]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] 

In [24]: np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] 
Out[24]: array([0, 3, 1, 4, 2, 3, 1])

方法2

def argsort_unique(idx): 
    # Original idea : https://stackoverflow.com/a/41242285/3293881 
    n = idx.size 
    sidx = np.empty(n,dtype=int) 
    sidx[idx] = np.arange(n) 
    return sidx 

def unique_return_inverse_tuples(tups): 
    a = np.array(tups) 
    sidx = np.lexsort(a.T) 
    b = a[sidx] 
    mask0 = ~((b[1:,0] == b[:-1,0]) & (b[1:,1] == b[:-1,1])) 
    ids = np.concatenate(([0], mask0 )) 
    np.cumsum(ids, out=ids) 
    return ids[argsort_unique(sidx)]

採樣運行 -

In [69]: tups 
Out[69]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] 

In [70]: unique_return_inverse_tuples(tups) 
Out[70]: array([0, 3, 1, 2, 4, 3, 1])

來源

2017-05-26 18:33:42 Divakar

我還沒有測試過任何東西。不過，我相信'np.unique'在使用'return_inverse = 1'時排序。這使得這個* O（nlogn）*。如我錯了請糾正我。 – piRSquared

@piRSquared那麼數組轉換本身看起來就像是這個瓶頸。看起來NumPy在這裏是混合類型數據的最佳選擇。 – Divakar

我不知道關於時間，但一個簡單的方法將沿着各自的軸使用numpy.unique。

tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] 
res = np.unique(tups, return_inverse=1, axis=0) 
print res

其產生

(array([['1', '2'], 
     ['3', '4'], 
     ['6', 'd'], 
     ['a', 'b'], 
     ['c', '5']], 
     dtype='|S11'), array([0, 3, 1, 4, 2, 3, 1], dtype=int64))

陣列被自動排序，但不應該是一個問題。

來源

2017-05-26 18:52:22 ImportanceOfBeingErnest

我怎麼錯過'np.unique'的'axis'參數！謝謝！對@ Divikar的回答仍然是同樣的批評。我相信這是* O（nlongn）*這不會像'pd.factorize'那麼快。我會測試它，看看。 – piRSquared

這隻適用於numpy 1.13 ... 1。12沒有軸 –

@GergesDib，這就是爲什麼我錯過了它:-) – piRSquared

我去給這個答案

pd.factorize([str(x) for x in tups])

但是，在運行一些測試後，它沒有做成是最快的這一切。既然我已經做的工作，我會在這裏表現出來的比較：

@AChampion

%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] 
1000000 loops, best of 3: 1.66 µs per loop

@Divakar

%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] 
# 10000 loops, best of 3: 58.1 µs per loop

@self

%timeit pd.factorize([str(x) for x in tups]) 
# 10000 loops, best of 3: 65.6 µs per loop

@root

%timeit pd.Series(tups).factorize()[0] 
# 1000 loops, best of 3: 199 µs per loop

編輯

對於100K項龐大的數據，我們有：

tups = [(np.random.randint(0, 10), np.random.randint(0, 10)) for i in range(100000)]

@root

%timeit pd.Series(tups).factorize()[0] 
100 loops, best of 3: 10.9 ms per loop

@AChampion

%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups] 

# 10 loops, best of 3: 16.9 ms per loop

@Divakar

%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1] 
# 10 loops, best of 3: 81 ms per loop

@self

%timeit pd.factorize([str(x) for x in tups]) 
10 loops, best of 3: 87.5 ms per loop

來源

2017-05-26 19:02:00

這是用於小數據，我敢肯定這與大數據開始看起來不同。 – piRSquared

我更新了它，也對更大的數據運行測試。 –

查看我更新的時機問題。 – piRSquared

@AChampion's使用setdefault讓我懷疑是否defaultdict可以用於這個問題。因此，從交流的答案自由惡癖：

In [189]: tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)] 

In [190]: import collections 
In [191]: import itertools 
In [192]: cnt = itertools.count() 
In [193]: dd = collections.defaultdict(lambda : next(cnt)) 

In [194]: [dd[t] for t in tups] 
Out[194]: [0, 1, 2, 3, 4, 1, 2]

時序其他SO問題表明defaultdict比直接用setdefault有點慢。這種方法的簡潔性仍然很有吸引力。

In [196]: dd 
Out[196]: 
defaultdict(<function __main__.<lambda>>, 
      {(1, 2): 0, (3, 4): 2, ('a', 'b'): 1, (6, 'd'): 4, ('c', 5): 3})

來源

2017-05-27 15:44:17 hpaulj

當我有機會時，我會更新我的時間。 – piRSquared

如何分解元組列表？

回答

相關問題