2017-07-07 56 views
2

我想在熊貓數據框的列上運行一個函數。 語料庫是pd.Dataframe在pandas Dataframe的列上運行函數的有效方法?

import pandas as pd 
import numpy as np 
from scipy.spatial.distance import cosine 

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]],index=["stark","groß","schwach","klein", "dick"],columns=["d1", "d2", "d3","d4","d5","d6"]) 

而且我有查詢。查詢是一個熊貓系列。

query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"]) 

現在我想在語料庫和查詢中的每一列上運行餘弦函數。

for column in corpus: 
print("Similarity of Documents", column," and query: \n" ,1-cosine(query, corpus[column])) 

有沒有更好的方法來運行列上的餘弦函數?也許某種方法可以獲取列並在每列上運行該函數。我想避免for循環。

+0

餘弦函數只是從scipy.spatial.distance scipy.spatial.distance.cosine進口的(U,V) 你和v是數組。 (餘弦計算兩個一維數組之間的距離。) – BenVes

+0

謝謝你,你是對的。我編輯了我的問題。 :) – BenVes

回答

2

你可以使用scipy.spatial.distance.cdist's'cosine'功能的矢量soliution,像這樣 -

from scipy.spatial.distance import cdist 

out = 1-cdist(query.values[None], corpus.values.T, 'cosine') 

採樣運行 -

In [192]: corpus 
Out[192]: 
     d1 d2 d3 d4 d5 d6 
stark  3 1 1 1 1 60 
groß  2 2 0 2 0 20 
schwach 0 2 1 1 0 0 
klein  0 0 2 1 0 1 
dick  0 0 0 0 1 0 

In [193]: query 
Out[193]: 
stark  1 
groß  1 
schwach 0 
klein  0 
dick  0 
dtype: int64 

In [194]: from scipy.spatial.distance import cosine 

In [195]: for column in corpus: 
    ...:  print(1-cosine(query, corpus[column])) 
    ...:  
0.980580675691 
0.707106781187 
0.288675134595 
0.801783725737 
0.5 
0.89431540856 

In [196]: 1-cdist(query.values[None], corpus.values.T, 'cosine') 
Out[196]: array([[ 0.98058, 0.70711, 0.28868, 0.80178, 0.5 , 0.89432]]) 

運行測試 -

In [225]: corpus = pd.DataFrame(np.random.rand(100,10000)) 

In [226]: query = pd.Series(np.random.rand(100)) 

# @C.Square's apply based soln 
In [227]: %timeit corpus.apply(lambda x:1-cosine(query, x), axis=0) 
1 loop, best of 3: 352 ms per loop 

# Proposed in this post using cdist() 
In [228]: %timeit 1-cdist(query.values[None], corpus.values.T, 'cosine') 
100 loops, best of 3: 3.2 ms per loop 
0

apply -ing功能是一個整潔,可讀和快速的方式這樣的工作:

import pandas as pd 
from scipy.spatial.distance import cosine 

corpus = pd.DataFrame([[3,1,1,1,1,60],[2,2,0,2,0,20], [0,2,1,1,0,0], [0,0,2,1,0,1],[0,0,0,0,1,0]], index=["stark","groß","schwach","klein", "dick"], columns=["d1", "d2", "d3","d4","d5","d6"]) 
query = pd.Series([1,1,0,0,0], index=["stark","groß","schwach","klein", "dick"]) 

corpus.apply(lambda x:1-cosine(query, x), # Apply your function 
      axis=0)      # For each column 

# d1 0.980581 
# d2 0.707107 
# d3 0.288675 
# d4 0.801784 
# d5 0.500000 
# d6 0.894315 
# dtype: float64 
1

您還可以使用的cosine的定義和實現自己

pandas

corpus.T.dot(query)/(corpus ** 2).sum() ** .5/(query ** 2).sum() ** .5 

d1 0.980581 
d2 0.707107 
d3 0.288675 
d4 0.801784 
d5 0.500000 
d6 0.894315 
dtype: float64 

numpy

c = corpus.values 
q = query.values 

r = c.T.dot(q)/(c ** 2).sum(0) ** .5/(q ** 2).sum() ** .5 

pd.Series(r, corpus.columns) 

d1 0.980581 
d2 0.707107 
d3 0.288675 
d4 0.801784 
d5 0.500000 
d6 0.894315 
dtype: float64 

與@ Divakar的建議
np.einsum

c = corpus.values 
q = query.values 

r = (
     np.einsum('ji,j->i', c, q)/
     np.einsum('ij,ij->j', c, c) ** .5/
     np.einsum('i,i', q, q) ** .5 
    ) 

pd.Series(r, corpus.columns) 

d1 0.980581 
d2 0.707107 
d3 0.288675 
d4 0.801784 
d5 0.500000 
d6 0.894315 
dtype: float64 
+1

我看到'einsum'有'(c ** 2).sum(0)',另一個! – Divakar

相關問題