2017-02-18 62 views
0

假設你在構建一個數據庫中的表如下:SQL計算

create table data (v int, base int, w_td float); 
insert into data values (99,1,4); 
insert into data values (99,2,3); 
insert into data values (99,3,4); 
insert into data values (1234,2,5); 
insert into data values (1234,3,2);  
insert into data values (1234,4,3); 

要清楚select * from data應該輸出:

v |base|w_td 
-------------- 
99 |1 |4.0 
99 |2 |3.0 
99 |3 |4.0 
1234|2 |5.0 
1234|3 |2.0 
1234|4 |3.0 

注意,由於矢量被存儲在數據庫,我們只需要存儲非零條目。在這個例子中,我們在$ \ mathbb {R}中只有兩個向量$ v_ {99} =(4,3,4,0)$和$ v_ {1234} =(0,5,2,3)$,^$ 4。

這些向量的餘弦相似度應該是$ \ displaystyle \ frac {23} {\ sqrt {41 \ cdot 38}} = 0.5826987807288609 $。

如何使用幾乎只有SQL來計算餘弦相似度?

我說差不多,因爲你將需要sqrt函數,而不是總是在基本SQL實現提供,例如它不是在sqlite3

回答

1
with norms as (
    select v, 
     sum(w_td * w_td) as w2 
    from data 
    group by v 
) 
select 
    x.v as ego,y.v as v,nx.w2 as x2, ny.w2 as y2, 
    sum(x.w_td * y.w_td) as innerproduct, 
    sum(x.w_td * y.w_td)/sqrt(nx.w2 * ny.w2) as cosinesimilarity 
from data as x 
join data as y 
    on (x.base=y.base) 
join norms as nx 
    on (nx.v=x.v) 
join norms as ny 
    on (ny.v=y.v) 
where x.v < y.v 
group by 1,2,3,4 
order by 6 desc 

產生

ego|v |x2 |y2 |innerproduct|cosinesimilarity 
-------------------------------------------------- 
99 |1234|41.0|38.0|23.0  |0.5826987807288609