2017-04-18 55 views
3

我的問題的簡化版本是我有一個包含以下字段的表:id,時間戳和數字變量(速度)。我需要確定速度的平均值小於閾值(例如2)的時間段(開始和結束時間戳),但是時間段(結束時間戳 - 開始時間戳)至少是最小持續時間(例如5小時以上)。基本上,我需要計算初始5小時窗口的平均值,如果平均值小於閾值,則保留開始時間戳,並使用end_timestamp前進一行並重新計算平均值。如果新的平均值小於閾值,則再次向前推進,擴大時間窗口。如果新平均值大於閾值,則報告前一個end_timestamp爲此窗口的end_timestamp,並啓動一個新的start_timestamp,並計算另一個5小時的新平均值。最終,最終產品是一張表,其中包含一組start_timestamps,end_timestamps(以及計算的持續時間),平均速度小於2,開始和結束之間的時間至少爲5小時。大查詢SQL:確定符合條件的最小長度的時間範圍

我正在使用Google Big Query: 以下是我迄今爲止的一般結構,但似乎沒有按照我的想法工作。首先,它只測試並報告最初5小時窗口的速度閾值,即使窗口增長。其次,它似乎沒有適當地增長窗口。很少有窗口長於5個小時,儘管事實上在某些情況下查看我的數據應該是兩倍。我希望有人試圖開發出類似的分析,並可以揭示我的錯在哪裏。

SELECT 
*, 
LEAD(start_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS 
next_start_timestamp, 
LEAD(end_timestamp) OVER (PARTITION BY id ORDER BY timestamp) AS 
next_end_timestamp 
FROM (
SELECT 
*, 
IF(last_timestamp IS NULL 
    OR timestamp - last_timestamp > 1000000*60*60*5, TRUE, FALSE) AS start_timestamp, #1000000*60*60*5 = 5 hours in microseconds 
IF(next_timestamp IS NULL 
    OR next_timestamp - timestamp > 1000000*60*60*5, TRUE, FALSE) AS end_timestamp #1000000*60*60*5 = 5 hours in microseconds 
FROM (
SELECT 
    *, 
    LAG(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) last_timestamp, 
    LEAD(timestamp,1) OVER (PARTITION BY id ORDER BY timestamp) next_timestamp, 
FROM (
    SELECT 
    *, 
    AVG(speed) OVER (PARTITION BY id ORDER BY timestamp RANGE BETWEEN 5 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) AS avg_speed_last_period, 
    FROM (
     SELECT 
     id, 
     timestamp, 
     speed 
     FROM 
     [dataset.table1])) 
WHERE 
    avg_speed_last_period < 2 
ORDER BY 
    id, 
    timestamp) 
HAVING 
    start_timestamp 
    OR end_timestamp) 

編輯: 下面是一些sample_data的鏈接。鑑於這些數據,平均速度小於2至少5個小時的要求,輸出表格的第一行會很有希望

ID start_event     end_event    average_speed duration_hrs 
203 2015-01-08 17:40:06 UTC 2015-01-09 07:09:35 UTC  0.7802  13.491 

203 2015-01-10 03:43:56 UTC 2015-01-10 08:48:57 UTC  1.452  5.083 
+0

樣本數據和預期的效果倒很幫助解釋。 –

+0

謝謝...添加示例數據和示例輸出 –

+0

您仍然留下一些開放的「漏洞」 - 請將第二行添加到預期的輸出中 - 至少對於我來說它會關閉一些 –

回答

1

從您的CSV,我假設下面的架構

enter image description here

在它下面的數據:

enter image description here

考慮到這一點 - 下面是工作代碼BigQuery的標準SQL
不正是您期待與輸出什麼

id     start_event     end_event average_speed duration_hrs 
203  2015-01-08 17:40:00 UTC 2015-01-09 07:09:00 UTC   0.78   13.48 
203  2015-01-10 03:43:00 UTC 2015-01-10 08:48:00 UTC   1.45   5.08 
#standardSQL 
CREATE TEMPORARY FUNCTION IdentifyTimeRanges(
    items ARRAY<STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>>, 
    min_length INT64, threshold FLOAT64, max_speed FLOAT64 
) 
RETURNS ARRAY<STRUCT<start_event TIMESTAMP, end_event TIMESTAMP, average_speed FLOAT64, duration_hrs FLOAT64>> 
LANGUAGE js AS """ 
    var result = []; 
    var initial = 0; 
    var candidate = items[initial].ts; 
    var len = 0; 
    var sum = 0; 
    for (i = 0; i < items.length; i++) { 
    len++; 
    sum += items[i].speed 

    if (items[i].ts - candidate < min_length) { 
     if (items[i].speed > max_speed) { 
     initial = i + 1; 
     candidate = items[initial].ts; 
     len = 0; 
     sum = 0; 
     }  
     continue; 
    } 

    if (sum/len > threshold || items[i].speed > max_speed) { 
     avg_speed = (sum - items[i].speed)/(len - 1); 
     if (avg_speed <= threshold && items[i - 1].ts - items[initial].ts >= min_length) { 
     var o = []; 
     o.start_event = items[initial].datetime; 
     o.average_speed = avg_speed.toFixed(3); 
     o.end_event = items[i - 1].datetime; 
     o.duration_hrs = ((items[i - 1].ts - items[initial].ts)/60/60).toFixed(3) 
     result.push(o) 
     } 
     initial = i; 
     candidate = items[initial].ts; 
     len = 1; 
     sum = items[initial].speed; 
    } 

    }; 

    return result; 
"""; 

WITH data AS (
    SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed 
    FROM `yourTable` 
), compact_data AS (
    SELECT id, ARRAY_AGG(STRUCT<ts INT64, speed FLOAT64, datetime TIMESTAMP>(UNIX_SECONDS(datetime), speed, datetime) ORDER BY UNIX_SECONDS(datetime)) AS points 
    FROM data 
    GROUP BY id 
) 
SELECT 
    id, start_event, end_event, average_speed, duration_hrs 
FROM compact_data, UNNEST(IdentifyTimeRanges(points, 5*60*60, 2, 3.1)) AS segment 
ORDER BY id, start_event 

請注意:此代碼使用User-Defined Functions這意味着一些limitsquotascost hit你要看你的數據

的大小

還要記住 - 如果datetime字段的數據類型不是STRING - 則只需要稍微調整data subquery - 其餘的應該保留原樣!

例如,如果日期時間是TIMESTAMP數據類型的 - 你只需要更換

SELECT id, PARSE_TIMESTAMP('%m/%d/%y %H:%M', datetime) AS datetime, speed 
    FROM `yourTable` 

SELECT id, datetime, speed 
    FROM `yourTable` 

希望你喜歡它:O)