2015-09-27 158 views
3

我有一個大的稀疏矩陣 - 使用scipy的sparse.csr_matrix。這些值是二進制的。對於每一行,我需要計算同一矩陣中每行的Jaccard距離。什麼是最有效的方式來做到這一點?即使對於10.000 x 10.000矩陣,我的運行時也需要幾分鐘才能完成。在稀疏矩陣上計算Jaccard距離

目前的解決方案:

def jaccard(a, b): 
    intersection = float(len(set(a) & set(b))) 
    union = float(len(set(a) | set(b))) 
    return 1.0 - (intersection/union) 

def regions(csr, p, epsilon): 
    neighbors = [] 
    for index in range(len(csr.indptr)-1): 
     if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon: 
      neighbors.append(index) 
    return neighbors 
csr = scipy.sparse.csr_matrix("file") 
regions(csr, 0.51) #this is called for every row 
+0

您能詳細說明導致運行時的實現細節嗎? – Ryan

+0

啊,對不起。已添加代碼。 –

+1

我會首先使用(推測)優化的jaccard函數,例如http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html – Ryan

回答

7

矢量化是比較容易,如果你使用的矩陣乘法計算交點集內,然後將規則|union(a, b)| == |a| + |b| - |intersection(a, b)|確定工會:

# Not actually necessary for sparse matrices, but it is for 
# dense matrices and ndarrays, if X.dtype is integer. 
from __future__ import division 

def pairwise_jaccard(X): 
    """Computes the Jaccard distance between the rows of `X`. 
    """ 
    X = X.astype(bool).astype(int) 

    intrsct = X.dot(X.T) 
    row_sums = intrsct.diagonal() 
    unions = row_sums[:,None] + row_sums - intrsct 
    dist = 1.0 - intrsct/unions 
    return dist 

注投爲bool然後int,因爲X的dtype必須足夠大以累積最大行總和的兩倍,並且X的條目必須爲零或一。這個代碼的缺點是RAM很重,因爲unionsdists是密集矩陣。

如果你只關心的距離小於一些截止epsilon,可以將代碼調整爲稀疏矩陣:

from scipy.sparse import csr_matrix 

def pairwise_jaccard_sparse(csr, epsilon): 
    """Computes the Jaccard distance between the rows of `csr`, 
    smaller than the cut-off distance `epsilon`. 
    """ 
    assert(0 < epsilon < 1) 
    csr = csr_matrix(csr).astype(bool).astype(int) 

    csr_rownnz = csr.getnnz(axis=1) 
    intrsct = csr.dot(csr.T) 

    nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1)) 
    unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data 
    dists = 1.0 - intrsct.data/unions 

    mask = (dists > 0) & (dists <= epsilon) 
    data = dists[mask] 
    indices = intrsct.indices[mask] 

    rownnz = np.add.reduceat(mask, intrsct.indptr[:-1]) 
    indptr = np.r_[0, np.cumsum(rownnz)] 

    out = csr_matrix((data, indices, indptr), intrsct.shape) 
    return out 

如果這仍然需要以多少RAM你可以嘗試向量化了一個維和Python循環。

+0

對於稀疏方法,因爲你正在鑄造CSR矩陣作爲整數,當你做分部時,你只會得到0或1分數。我想你可以把它作爲float來代替。 – Laggel