2017-03-16 54 views
1

我想知道每個ID在起始位置花了多少時間。查找在每個位置ID花費的時間

例如,在下面的數據集中,啓動id爲286的Geohash爲「abcdef」。 Geohash「abcdef」出現在3個地方爲Id 286. 因此Id 286花費的總時間是(2017-02-13 12:33:02.063 UTC - 2017-02-13 12:24:36 UTC)和(2017-02-13 12:34:29 UTC - 2017-02-13 12:33:08 UTC)。

 Id   DateTime      Latitude  Longitude Geohash 
     0 286  2017-02-13 12:24:36 UTC  40.769230 -73.01205  abcdef 
     1 286  2017-02-13 12:33:02.063 UTC 40.769230 -73.01202  abcdef 
     2 286  2017-02-13 12:33:05.063 UTC 40.769230 -73.01202  cvzvvv 
     3 286  2017-02-13 12:33:08 UTC  40.769280 -73.01212  abcdef 
     4 286  2017-02-13 12:34:29 UTC  40.769306 -73.01207  hsffds 
     5 368  2017-02-13 00:23:07.063 UTC 33.392820 -111.8262  weruio 
     6 141  2017-02-13 00:00:41 UTC  33.287117 -111.84150 oqruqq 

不確定pandas數據框中是否有任何函數可以實現此操作。

任何幫助將非常感激。 !

回答

1

下面是BigQuery的標準SQL

#standardSQL 
SELECT 
    Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent 
FROM (
    SELECT 
    Id, Geohash, DateTime, 
    TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent, 
    FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash 
    FROM yourTable 
) 
WHERE Geohash = FirstGeohash 
GROUP BY Id, Geohash 

你可以從你的榜樣虛擬數據測試:

#standardSQL 
WITH yourTable AS (
    SELECT 286 AS Id, TIMESTAMP '2017-02-13 12:24:36 UTC' AS DateTime, 40.769230 AS Latitude, -73.01205 AS Longitude, 'abcdef' AS Geohash UNION ALL 
    SELECT 286, TIMESTAMP '2017-02-13 12:33:02.063 UTC', 40.769230, -73.01202, 'abcdef' UNION ALL 
    SELECT 286, TIMESTAMP '2017-02-13 12:33:05.063 UTC', 40.769230, -73.01202, 'cvzvvv' UNION ALL 
    SELECT 286, TIMESTAMP '2017-02-13 12:33:08 UTC', 40.769280, -73.01212, 'abcdef' UNION ALL 
    SELECT 286, TIMESTAMP '2017-02-13 12:34:29 UTC', 40.769306, -73.01207, 'hsffds' UNION ALL 
    SELECT 368, TIMESTAMP '2017-02-13 00:23:07.063 UTC', 33.392820, -111.8262, 'weruio' UNION ALL 
    SELECT 141, TIMESTAMP '2017-02-13 00:00:41 UTC', 33.287117, -111.84150, 'oqruqq' 
) 
SELECT 
    Id, Geohash, MIN(DateTime) AS StartDateTime, SUM(TimeSpent) AS TimeSpent 
FROM (
    SELECT 
    Id, Geohash, DateTime, 
    TIMESTAMP_DIFF(LEAD(DateTime) OVER(PARTITION BY Id ORDER BY DateTime), DateTime, SECOND) AS TimeSpent, 
    FIRST_VALUE(Geohash) OVER(PARTITION BY Id ORDER BY DateTime) AS FirstGeohash 
    FROM yourTable 
) 
WHERE Geohash = FirstGeohash 
GROUP BY Id, Geohash 

結果如下

Id Geohash  StartDateTime   TimeSpent  
286 abcdef  2017-02-13 12:24:36 UTC  590  
368 weruio  2017-02-13 00:23:07 UTC  null  
141 oqruqq  2017-02-13 00:00:41 UTC  null  

請注意:上述590是timespent的三個頁面的總和(以秒爲單位) - 不只是兩頁,因爲它是你的問題說 - 我認爲這只是錯字就在你身邊

0

如果我理解正確的話,你想是這樣的:

def timedelta(df): 
    df = df.sort_values(by='DateTime') 
    return df.iloc[0]['DateTime'] - df.iloc[-1]['DateTime'] 

df.groupby(['Id', 'Geohash']).apply(timedelta)