2016-04-21 53 views
0

我試圖實現一個花哨的SQL查詢,但在嘗試執行連接和計數時遇到了問題。SQL REDDIT - Jaccard Similarity

我有數據很長表:

author | group | id | 

daniel | group1| 118 
adam | group2| 126 
harry | group1| 221 
daniel | group2| 323 
daniel | group2| 122 
daniel | group5| 322 
harry | group1| 222 
harry | group1| 225 

... ...

我希望我的輸出看起來像:

author1 | author2 | intersection | union 

daniel | adam | 2 | 3 
daniel | harry| 2 | 11 
adam | harry| 0 | 10 

這裏交匯定義作爲作者2共同擁有author1 &的組的數量,並且union =組數量author1 +作者 - 交集。

我認爲這樣做的正確方法是

表左JOIN B表上a.group == b.group

,但我無法弄清楚如何做合計數。

感謝enter code here

回答

1

「跳的」,因爲1)仍然沒有看到任何答案2)看到作者的與BigQuery標籤相關的問題

所以,從理論上講,下面的查詢將讓你的任務(使用的BigQuery對於下面的實施例-samples.reddit.full表):

大量查詢傳統的SQL:

SELECT 
    a.author AS author1, 
    b.author AS author2, 
    SUM(a.subr = b.subr) AS count_intersection, 
    EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union 
FROM 
    (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a 
CROSS JOIN 
    (SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b 
WHERE a.author < b.author 
GROUP BY 1, 2 
ORDER BY count_intersection DESC 
LIMIT 100 

大量查詢標準SQL:

WITH subrs AS (
    SELECT author, subr 
    FROM `bigquery-samples.reddit.full` 
    GROUP BY 1, 2 
) 
SELECT 
    a.author AS author1, 
    b.author AS author2, 
    COUNTIF(a.subr = b.subr) AS count_intersection, 
    COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union 
FROM subrs AS a 
JOIN subrs AS b 
ON a.author < b.author 
GROUP BY 1, 2 
ORDER BY count_intersection DESC 
LIMIT 100 

如果你試圖運行它們,你最有可能得到以下錯誤

發生內部錯誤,請求無法完成

的原因是因爲每個在這兩個查詢中,由於連接而產生大約1萬億行(參見下面的統計數據)。 有很多方法可以解決這個問題 - 下面提出的方法是通過調整需求來解決這個問題。 你是否真的需要參與算法光作者,讓我們說只有一個或兩個subreddits? 或者 - 你真的想找到那些在特定子目錄中只有很少評論的人之間的相似性嗎?

下面參見,如何引入額外的限制以上述查詢執行幫助(注:lines是每SUBR每個作者的條目分鐘極限計數和subrs是每個用戶SUBR的數目的最小限制)

enter image description here

下面是版本,實際上產生的結果W/O任何類型的故障:

標準SQL

WITH authors AS (
    SELECT author FROM (
    SELECT author, COUNT(1) AS subrs FROM (
     SELECT author, subr, COUNT(1) AS lines 
     FROM `bigquery-samples.reddit.full` 
     GROUP BY 1, 2 
     HAVING lines > 1 
    ) 
    GROUP BY author 
    HAVING subrs > 3 
) 
), 
subrs AS (
    SELECT author, subr 
    FROM `bigquery-samples.reddit.full` 
    WHERE author IN (SELECT author FROM authors) 
    GROUP BY 1, 2 
) 
SELECT 
    a.author AS author1, 
    b.author AS author2, 
    COUNTIF(a.subr = b.subr) AS count_intersection, 
    COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union 
FROM subrs AS a JOIN subrs AS b 
ON a.author < b.author 
GROUP BY 1, 2 
ORDER BY count_intersection DESC 
LIMIT 100 

以類似的方式,你可以調整傳統的SQL,使其工作

這可能不是最好的方式 - 但至少給出了這樣的任務,一些希望能夠輕鬆地內的BigQuery運行,W/O前往其它解決方法