有效計數numpy列中的重複值並追加計數

我有一個數據集表示有向圖。第一列是源節點，第二列是目標節點，我們可以忽略第三列（本質上是權重）。例如：有效計數numpy列中的重複值並追加計數

我想要做的是追加每個節點的出度。例如，如果我剛添加的出度節點0，我會：

0 1 3 5 
0 13 1 5 
0 37 1 5 
0 51 1 5 
0 438481 1 5 
1 0 3 
...

我有一些代碼，這樣做，但它是非常緩慢的，因爲我用的是for循環：

import numpy as np 

def save_degrees(X): 
    new_col = np.zeros(X.shape[0], dtype=np.int) 
    X = np.column_stack((X, new_col)) 
    node_ids, degrees = np.unique(X[:, 0], return_counts=True) 
    # This is the slow part. 
    for node_id, deg in zip(node_ids, degrees): 
     indices = X[:, 0] == node_id 
     X[:, -1][indices] = deg 
    return X 

train_X = np.load('data/train_X.npy') 
train_X = save_degrees(train_X) 
np.save('data/train_X_degrees.npy', train_X)

有沒有更有效的方式來建立這個數據結構？

來源

2017-04-05 gwg

第一列是否總是排序？ – Divakar

我相信它是排序的，但我可以根據需要對其進行排序。 – gwg

已發佈的解決方案是否適合您？ – Divakar

您可以使用numpy.unique。

假設你輸入的數據是數組data在：

In [245]: data 
Out[245]: 
array([[  0,  1,  3], 
     [  0,  13,  1], 
     [  0,  37,  1], 
     [  0,  51,  1], 
     [  0, 438481,  1], 
     [  1,  0,  3], 
     [  1,  4, 354], 
     [  1,  10, 2602], 
     [  1,  11, 2689], 
     [  1,  12,  1], 
     [  1,  18, 345], 
     [  1,  19, 311], 
     [  1,  23,  1], 
     [  1,  24, 366], 
     [  2,  10,  1], 
     [  2,  13,  3], 
     [  2,  99,  5], 
     [  3,  25,  13], 
     [  3,  99,  15]])

查找第一列中的唯一值，與「逆」數組，每個唯一值的出現的次數一起：

In [246]: nodes, inv, counts = np.unique(data[:,0], return_inverse=True, return_counts=True)

你的出度列counts[inv]：

In [247]: out_degrees = counts[inv] 

In [248]: out_degrees 
Out[248]: array([5, 5, 5, 5, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 2, 2])

這假定一對（source_node，target_node）在data陣列中不會出現一次以上。

來源

2017-04-05 20:59:41

你可以試試這個，當你有很多不同的節點時，通常X[:, 0] == node_id是很耗時的。您可以先排序的第一列數據，然後通過重複計數創建一個新的數列：

train_X = train_X[train_X[:, 0].argsort()] 
_, counts = np.unique(train_X[:,0], return_counts=True) 
np.hstack((train_X, np.repeat(counts, counts)[:, None])) 

# array([[  0,  1,  3,  5], 
#  [  0,  13,  1,  5], 
#  [  0,  37,  1,  5], 
#  [  0,  51,  1,  5], 
#  [  0, 438481,  1,  5], 
#  [  1,  0,  3,  9], 
#  [  1,  4, 354,  9], 
#  [  1,  10, 2602,  9], 
#  [  1,  11, 2689,  9], 
#  [  1,  12,  1,  9], 
#  [  1,  18, 345,  9], 
#  [  1,  19, 311,  9], 
#  [  1,  23,  1,  9], 
#  [  1,  24, 366,  9]])

或者你可以使用熊貓GROUPBY：

import pandas as pd 
pd.DataFrame(train_X).pipe(lambda x: x.assign(size = x.groupby([0])[0].transform('size'))).values 

#array([[  0,  1,  3,  5], 
#  [  0,  13,  1,  5], 
#  [  0,  37,  1,  5], 
#  [  0,  51,  1,  5], 
#  [  0, 438481,  1,  5], 
#  [  1,  0,  3,  9], 
#  [  1,  4, 354,  9], 
#  [  1,  10, 2602,  9], 
#  [  1,  11, 2689,  9], 
#  [  1,  12,  1,  9], 
#  [  1,  18, 345,  9], 
#  [  1,  19, 311,  9], 
#  [  1,  23,  1,  9], 
#  [  1,  24, 366,  9]])

來源

2017-04-05 20:46:28 Psidom

np.unique確實在這裏做得很好，正如其他一些答案中所解釋的那樣。

你也許想看看numpy_indexed（免責聲明：我是它的作者）;它可以用相同的效率來做同樣的事情，但也支持許多其他功能，這在使用圖形時往往非常有用;或一般的稀疏/鋸齒狀數據結構。

它也有專門一個乾淨的線解決問題的方法：

import numpy_indexed as npi 
X = np.column_stack((X, npi.multiplicity(X[:, 0])))

來源

2017-04-05 21:05:46

這裏的重點是表現一個量化的方法 -

def argsort_unique(idx): 
    # Original idea : http://stackoverflow.com/a/41242285/3293881 
    n = idx.size 
    sidx = np.empty(n,dtype=int) 
    sidx[idx] = np.arange(n) 
    return sidx 

def count_and_append(a): # For sorted arrays 
    a0 = a[:,0] 
    sf0 = np.flatnonzero(a0[1:] != a0[:-1])+1 
    shift_idx = np.concatenate(([0] , sf0, [a0.size])) 
    c = shift_idx[1:] - shift_idx[:-1] 
    out_col = np.repeat(c,c) 
    return np.column_stack((a, out_col)) 

def count_and_append_generic(a): # For generic (not necessarily sorted) arrays 
    sidx = a[:,0].argsort() 
    b = a[sidx] 
    return count_and_append(b)[argsort_unique(sidx)]

採樣運行 -

In [70]: a # Not sorted case 
Out[70]: 
array([[  1,  18, 345], 
     [  1,  23,  1], 
     [  0,  13,  1], 
     [  0,  37,  1], 
     [  2,  99,  5], 
     [  0,  1,  3], 
     [  2,  13,  3], 
     [  1,  4, 354], 
     [  1,  24, 366], 
     [  0, 438481,  1], 
     [  1,  12,  1], 
     [  1,  11, 2689], 
     [  1,  19, 311], 
     [  2,  10,  1], 
     [  3,  99,  15], 
     [  0,  51,  1], 
     [  3,  25,  13], 
     [  1,  0,  3], 
     [  1,  10, 2602]]) 

In [71]: np.allclose(count_and_append_generic(a), save_degrees(a)) 
Out[71]: True

如果輸入數組已經按第一列排序，只需使用count_and_append(a)。

來源

2017-04-05 21:46:51 Divakar

這工作，也很快。接受的答案需要少一點的代碼，但如果有人閱讀這個答案，那麼這是一個完整的解決方案。 – gwg

有效計數numpy列中的重複值並追加計數

回答

相關問題