「跳的」,因爲1)仍然沒有看到任何答案2)看到作者的與BigQuery標籤相關的問題
所以,從理論上講,下面的查詢將讓你的任務(使用的BigQuery對於下面的實施例-samples.reddit.full表):
大量查詢傳統的SQL:
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.subr = b.subr) AS count_intersection,
EXACT_COUNT_DISTINCT(a.subr) + EXACT_COUNT_DISTINCT(b.subr) - SUM(a.subr = b.subr) AS count_union
FROM
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS a
CROSS JOIN
(SELECT author, subr FROM [bigquery-samples:reddit.full] GROUP BY 1, 2) AS b
WHERE a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
大量查詢標準SQL:
WITH subrs AS (
SELECT author, subr
FROM `bigquery-samples.reddit.full`
GROUP BY 1, 2
)
SELECT
a.author AS author1,
b.author AS author2,
COUNTIF(a.subr = b.subr) AS count_intersection,
COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a
JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
如果你試圖運行它們,你最有可能得到以下錯誤
發生內部錯誤,請求無法完成
的原因是因爲每個在這兩個查詢中,由於連接而產生大約1萬億行(參見下面的統計數據)。 有很多方法可以解決這個問題 - 下面提出的方法是通過調整需求來解決這個問題。 你是否真的需要參與算法光作者,讓我們說只有一個或兩個subreddits? 或者 - 你真的想找到那些在特定子目錄中只有很少評論的人之間的相似性嗎?
下面參見,如何引入額外的限制以上述查詢執行幫助(注:lines
是每SUBR每個作者的條目分鐘極限計數和subrs
是每個用戶SUBR的數目的最小限制)
下面是版本,實際上產生的結果W/O任何類型的故障:
標準SQL
WITH authors AS (
SELECT author FROM (
SELECT author, COUNT(1) AS subrs FROM (
SELECT author, subr, COUNT(1) AS lines
FROM `bigquery-samples.reddit.full`
GROUP BY 1, 2
HAVING lines > 1
)
GROUP BY author
HAVING subrs > 3
)
),
subrs AS (
SELECT author, subr
FROM `bigquery-samples.reddit.full`
WHERE author IN (SELECT author FROM authors)
GROUP BY 1, 2
)
SELECT
a.author AS author1,
b.author AS author2,
COUNTIF(a.subr = b.subr) AS count_intersection,
COUNT(DISTINCT a.subr) + COUNT(DISTINCT b.subr) - COUNTIF(a.subr = b.subr) AS count_union
FROM subrs AS a JOIN subrs AS b
ON a.author < b.author
GROUP BY 1, 2
ORDER BY count_intersection DESC
LIMIT 100
以類似的方式,你可以調整傳統的SQL,使其工作
這可能不是最好的方式 - 但至少給出了這樣的任務,一些希望能夠輕鬆地內的BigQuery運行,W/O前往其它解決方法