矢量化是比較容易,如果你使用的矩陣乘法計算交點集內,然後將規則|union(a, b)| == |a| + |b| - |intersection(a, b)|
確定工會:
# Not actually necessary for sparse matrices, but it is for
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division
def pairwise_jaccard(X):
"""Computes the Jaccard distance between the rows of `X`.
"""
X = X.astype(bool).astype(int)
intrsct = X.dot(X.T)
row_sums = intrsct.diagonal()
unions = row_sums[:,None] + row_sums - intrsct
dist = 1.0 - intrsct/unions
return dist
注投爲bool然後int,因爲X
的dtype必須足夠大以累積最大行總和的兩倍,並且X
的條目必須爲零或一。這個代碼的缺點是RAM很重,因爲unions
和dists
是密集矩陣。
如果你只關心的距離小於一些截止epsilon
,可以將代碼調整爲稀疏矩陣:
from scipy.sparse import csr_matrix
def pairwise_jaccard_sparse(csr, epsilon):
"""Computes the Jaccard distance between the rows of `csr`,
smaller than the cut-off distance `epsilon`.
"""
assert(0 < epsilon < 1)
csr = csr_matrix(csr).astype(bool).astype(int)
csr_rownnz = csr.getnnz(axis=1)
intrsct = csr.dot(csr.T)
nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
dists = 1.0 - intrsct.data/unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
indptr = np.r_[0, np.cumsum(rownnz)]
out = csr_matrix((data, indices, indptr), intrsct.shape)
return out
如果這仍然需要以多少RAM你可以嘗試向量化了一個維和Python循環。
您能詳細說明導致運行時的實現細節嗎? – Ryan
啊,對不起。已添加代碼。 –
我會首先使用(推測)優化的jaccard函數,例如http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jaccard.html – Ryan