2017-10-10 23 views
0

有沒有辦法通過檢查特定字符串在特定文件類型中出現的頻率來查詢BigQuery project HTTPArchive中的表?如何獲得來自SQL的字符串每次出現頻率的是/否統計信息

我能夠爲單個檢查編寫一個查詢,但是如何一次執行多個字符串的查詢,而無需每次都發送相同的查詢,只需使用不同的字符串檢查並處理〜800GB的表數據時間?

獲取結果作爲數組可能以某種方式工作?我想免費向公衆發佈深入的每月統計數據,所以單獨發送這些查詢並獲得約2000美元/月的查詢費用的選項對我來說不適合作爲學生。

SELECT matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio 
FROM (
    SELECT url, (LOWER(body) CONTAINS 'document.write') AS matched 
    FROM httparchive.har.2017_09_01_chrome_requests_bodies 
    WHERE url LIKE "%.js" 
) 
GROUP BY matched 

請注意,這僅僅是一個很多的例子(〜50)和pre-generated stats是不是我期待的,因爲它不包含所需的信息。

回答

1

下面是BigQuery的標準SQL

#standardSQL 
WITH strings AS (
    SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str 
), files AS (
    SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext 
) 
SELECT 
    ext, str, COUNT(1) total, 
    COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches, 
    ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str))/COUNT(1), 3) ratio 
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b 
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext) 
CROSS JOIN strings s 
GROUP BY ext, str 
-- ORDER BY ext, str 

您可以測試/以上使用[全部]虛擬數據如下

#standardSQL 
WITH `httparchive.har.2017_09_01_chrome_requests_bodies` AS (
    SELECT '1234.js' AS url, 'abc=1;x=2' AS body UNION ALL 
    SELECT 'qaz.js', 'y=1;xyz=0' UNION ALL 
    SELECT 'edc.go', 's=1;xyz=2;abc=3' UNION ALL 
    SELECT 'edc.go', 's=1;xyz=4;abc=5' UNION ALL 
    SELECT 'rfv.php', 'd=1' UNION ALL 
    SELECT 'tgb.txt', '?abc=xyz' UNION ALL 
    SELECT 'yhn.php', 'like v' UNION ALL 
    SELECT 'ujm.go', 'lkjsad' UNION ALL 
    SELECT 'ujm.go', 'yhj' UNION ALL 
    SELECT 'ujm.go', 'dfgh' UNION ALL 
    SELECT 'ikl.js', 'werwer' 
), strings AS (
    SELECT LOWER(str) str FROM UNNEST(['abc', 'XYZ']) AS str 
), files AS (
    SELECT LOWER(ext) ext FROM UNNEST(['JS', 'go', 'php'])AS ext 
) 
SELECT 
    ext, str, COUNT(1) total, 
    COUNTIF(REGEXP_CONTAINS(LOWER(body), str)) matches, 
    ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(body), str))/COUNT(1), 3) ratio 
FROM `httparchive.har.2017_09_01_chrome_requests_bodies` b 
JOIN files f ON LOWER(url) LIKE CONCAT('%.', ext) 
CROSS JOIN strings s 
GROUP BY ext, str 
ORDER BY ext, str 
1

一種方法是用不同的字符串引入表格。這是主意:

SELECT str, matched, count(*) AS total, RATIO_TO_REPORT(total) OVER() AS ratio 
FROM (SELECT crb.url, s.str, (LOWER(crb.body) CONTAINS s.str) AS matched 
     FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN 
      (SELECT 'document.write' as str UNION ALL 
      SELECT 'xxx' as str 
      ) s 
     WHERE url LIKE "%.js" 
    ) 
GROUP BY str, matched; 

您只需向s添加更多字符串即可。

+0

感謝玩,但我得到的錯誤:查詢失敗。錯誤:未找到字段'str'。 – DevDavid

+0

@NhanNguyen。 。 。感謝您解決這個問題。 –

+0

我只得到它與一個字符串,如果我添加更多,我得到不正確的計數/比率(加上我怎麼可以引用哪些數據屬於哪個字符串?): 'SELECT s.str,s.str2,matched,matched2 (*)AS total,RATIO_TO_REPORT(total)OVER()AS ratio FROM(SELECT crb.url,s.str,s.str2,(LOWER(crb.body)CONTAINS s.str)AS matched,(LOWER (crb.body)CONTAINS s.str2)AS matched2 FROM httparchive.har.2017_09_01_chrome_requests_bodies crb CROSS JOIN (SELECT'document.write'as str,'document.ready'as str2)s WHERE URL LIKE「%.js 「) GROUP BY s.str,s.str2,matched,matched2;' – DevDavid

相關問題