2016-01-22 84 views
0

使用dplyr生成訂單排序列我對dplyr從特定消費者的事務日誌生成tbl_df對象上的排列列 有點困難。數據我有這樣的:基於分組變量

         consumerid merchant_id  eventtimestamp merchant_visit_rank 
               (chr)  (int)    (time)   (dbl) 
      1 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-15 13:33:00    0 
      2 004a5cc3-3d60-4d14-85b3-706e454aae13   56 2015-01-16 13:58:03    1 
      3 004a5cc3-3d60-4d14-85b3-706e454aae13   56 2015-01-16 13:58:41    0 
      4 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-16 13:59:05    1 
      5 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-16 13:59:55    1 
      6 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-16 14:15:56    0 
      7 004a5cc3-3d60-4d14-85b3-706e454aae13   58 2015-01-21 13:52:18    1 
      8 004a5cc3-3d60-4d14-85b3-706e454aae13   58 2015-01-21 13:52:19    0 
      9 004a5cc3-3d60-4d14-85b3-706e454aae13   54 2015-01-21 13:52:24    0 
      10 004a5cc3-3d60-4d14-85b3-706e454aae13   58 2015-01-21 13:52:29    0 
      ..         ...   ...     ...   ... 

我要那麼它告訴我這個交易 會議期間就這個商家的訂單生成一個商人訪問級別。在我們的情況下,正確的排名會看:

         consumerid merchant_id  eventtimestamp merchant_visit_rank 
               (chr)  (int)    (time)   (dbl) 
      1 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-15 13:33:00    1 
      2 004a5cc3-3d60-4d14-85b3-706e454aae13   56 2015-01-16 13:58:03    2 
      3 004a5cc3-3d60-4d14-85b3-706e454aae13   56 2015-01-16 13:58:41    2 
      4 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-16 13:59:05    3 
      5 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-16 13:59:55    3 
      6 004a5cc3-3d60-4d14-85b3-706e454aae13   52 2015-01-16 14:15:56    3 
      7 004a5cc3-3d60-4d14-85b3-706e454aae13   58 2015-01-21 13:52:18    4 
      8 004a5cc3-3d60-4d14-85b3-706e454aae13   58 2015-01-21 13:52:19    4 
      9 004a5cc3-3d60-4d14-85b3-706e454aae13   54 2015-01-21 13:52:24    5 
      10 004a5cc3-3d60-4d14-85b3-706e454aae13   58 2015-01-21 13:52:29    6 
      ..         ...   ...     ...   ... 

我曾嘗試在dplyr與窗口的功能發揮是這樣的:

  measure_media_interaction %>% 
       #selecting the fields we wish from the dataframe 
       select(consumerid,merchant_id,eventtimestamp) %>% 
       #mutate a placeholder column to be used for the rank 
       mutate(merchant_visit = 0) %>% 
       #sort them by consumer and timestamp 
       arrange(consumerid,eventtimestamp) %>% 
       #change the column so it shows that this merchant was the first this consumer visited 
       #or not 
       mutate(merchant_visit = 
         ifelse(lead(merchant_id)!=merchant_id,merchant_visit,merchant_visit+1)) 

但是我堅持,我不知道該怎麼辦它有效。對此有何想法?

回答

0

這是一個解決方案。我們使用lag來測試merchant_id是否更改,並使用cumsum來增加計數器。

measure_media_interaction %>% 
    select(consumerid,merchant_id,eventtimestamp) %>% 
    arrange(consumerid,eventtimestamp) %>% 
    mutate(merchant_visit=cumsum(c(1,(merchant_id != lag(merchant_id))[-1])))