2017-12-18 217 views
0

我試圖找出一種方法來刪除重疊時間的記錄,但我無法找出保持所有但這些記錄重疊的一個的簡單和優雅的方法。這個問題與this one類似,但有一些差異。我們的表看起來像:PostgreSQL查詢刪除重疊時間記錄,同時保留最早?

╔════╤═══════════════════════════════════════╤══════════════════════════════════════╤════════╤═════════╗ 
║ id │ start_time       │ end_time        │ bar │ baz  ║ 
╠════╪═══════════════════════════════════════╪══════════════════════════════════════╪════════╪═════════╣ 
║ 0 │ Mon, 18 Dec 2017 16:08:33 UTC +00:00 │ Mon, 18 Dec 2017 17:08:33 UTC +00:00 │ "ham" │ "eggs" ║ 
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢ 
║ 1 │ Mon, 18 Dec 2017 16:08:32 UTC +00:00 │ Mon, 18 Dec 2017 17:08:32 UTC +00:00 │ "ham" │ "eggs" ║ 
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢ 
║ 2 │ Mon, 18 Dec 2017 16:08:31 UTC +00:00 │ Mon, 18 Dec 2017 17:08:31 UTC +00:00 │ "spam" │ "bacon" ║ 
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢ 
║ 3 │ Mon, 18 Dec 2017 16:08:30 UTC +00:00 │ Mon, 18 Dec 2017 17:08:30 UTC +00:00 │ "ham" │ "eggs" ║ 
╚════╧═══════════════════════════════════════╧══════════════════════════════════════╧════════╧═════════╝ 

在上面的例子中,所有的記錄有重疊的時間,其中重疊只是意味着時間的範圍內定義的記錄的start_timeend_time(含)覆蓋或延伸的一部分另一個記錄。但是,對於這個問題,我們不僅對那些有重疊時間的記錄感興趣,而且還有匹配的barbaz列(上面的行0,1和3)。在找到這些記錄後,我們希望最早刪除所有記錄,僅留下記錄2和3的表格,因爲記錄2沒有匹配的barbaz列,而且3具有最早的開始和結束時間。

這是我到目前爲止有:

delete from foos where id in (
    select 
     foo_one.id 
    from 
     foos foo_one 
    where 
     user_id = 42 
     and exists (
     select 
      1 
     from 
      foos foo_two 
     where 
      tsrange(foo_two.start_time::timestamp, foo_two.end_time::timestamp, '[]') && 
      tsrange(foo_one.start_time::timestamp, foo_one.end_time::timestamp, '[]') 
      and 
      foo_one.bar = foo_two.bar 
      and 
      foo_one.baz = foo_two.baz 
      and 
      user_id = 42 
      and 
      foo_one.id != foo_two.id 
    ) 
); 

感謝您的閱讀!

更新:我發現,對我工作的解決方案,基本上我可以申請窗口函數row_number()在由barbaz領域分組,則該表的分區添加WHERE條款的DELETE聲明不包括第一個條目(最小的那個)id

delete from foos where id in (
    select id from (
     select 
      foo_one.id, 
      row_number() over(partition by 
           bar, 
           baz 
          order by id asc) 
     from 
      foos foo_one 
     where 
      user_id = 42 
      and exists (
      select 
       * 
      from 
       foos foo_two 
      where 
       tsrange(foo_two.start_time::timestamp, 
         foo_two.end_time::timestamp, 
         '[]') && 
       tsrange(foo_one.start_time::timestamp, 
         foo_one.end_time::timestamp, 
         '[]') 
       and 
       foo_one.id != foo_two.id 
     ) 
    ) foos where row_number <> 1 
); 
+0

請編輯您的問題,並添加一些[樣本數據](http://plaintexttools.github.io/plain-text-table/)和基於該數據的預期輸出。 [格式化文本](http://stackoverflow.com/help/formatting)請,[無屏幕截圖](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images -of碼上那麼當灰化-A-問題/ 285557#285557)。 – klin

+0

我很好奇它爲什麼被標記爲ruby-on-rails – jvillian

+0

因爲它是用於RoR項目的,並且我不希望人們在上面的查詢中遇到ruby樣式的字符串插值。 – dynsne

回答

1

首先,小記:你真的應該提供一些更多的信息。我知道你可能不想展示你的業務的一些真實的專欄,但它的方式使你更難理解你想要的東西。

但是,我將就這個問題提供一些提示。我希望這能幫助你,以及有類似問題的人。

  1. 你需要明確什麼定義重疊。對每個人來說,這可能有很多不同的事情。

看看這些事件:

<--a--> 
    <---- b ----> 
     <---- c ----> 
      <-- d --> 
      <---- e ----> 
    <------- f --------> 
        <--- g ---> 

如果定義重疊像谷歌的定義:上延伸,以覆蓋部分,然後 「B」, 「d」, 「E」和「f」重疊部分「c」事件。如果定義重疊就像覆蓋整個事件一樣,則「c」重疊「d」,並且「f」重疊「b」和「c」和「d」。

  1. 刪除組可能是一個問題。在之前的情況下,我們應該做什麼?我們是否應該刪除「b」,「c」和「d」並保持「f」?我們應該總結他們的價值嗎也許是平均值?所以,這是一個逐列的決定。每列的含義非常重要。所以,我無法幫助你「酒吧」和「巴茲」。

  2. 所以,試圖猜測你真的想,我創造與ID事件的類似的表什麼,開始,結束和user_id說明

    create table events (
        id integer, 
        user_id integer, 
        start_time timestamp, 
        end_time timestamp, 
        name varchar(100) 
    ); 
    

我加入例如值

現在
insert into events 
    (id, user_id, start_time, end_time, name) values 
    (1, 1000, timestamp('2017-10-09 01:00:00'),timestamp('2017-10-09 04:00:00'), 'a'); 

    insert into events 
    (id, user_id, start_time, end_time, name) values 
    (2, 1000, timestamp('2017-10-09 03:00:00'),timestamp('2017-10-09 15:00:00'), 'b'); 

    insert into events 
    (id, user_id, start_time, end_time, name) values 
    (3, 1000, timestamp('2017-10-09 07:00:00'),timestamp('2017-10-09 19:00:00'), 'c'); 

    insert into events 
    (id, user_id, start_time, end_time, name) values 
    (4, 1000, timestamp('2017-10-09 09:00:00'),timestamp('2017-10-09 17:00:00'), 'd'); 

    insert into events 
    (id, user_id, start_time, end_time, name) values 
    (5, 1000, timestamp('2017-10-09 17:00:00'),timestamp('2017-10-09 23:00:00'), 'e'); 

    insert into events 
    (id, user_id, start_time, end_time, name) values 
    (6, 1000, timestamp('2017-10-09 02:30:00'),timestamp('2017-10-09 22:00:00'), 'f'); 

    insert into events 
    (id, user_id, start_time, end_time, name) values 
    (7, 1000, timestamp('2017-10-09 17:30:00'),timestamp('2017-10-10 02:00:00'), 'g'); 

,我們可以用一些不錯的發揮疑問:

列出所有充滿事件另一個事件重疊:

select 
    # EVENT NAME 
    event_1.name as event_name, 
    # LIST EVENTS THAT THE EVENT OVERLAPS 
    GROUP_CONCAT(event_2.name) as overlaps_names 
from events as event_1 
inner join events as event_2 
on 
    event_1.user_id = event_2.user_id 
and 
    event_1.id != event_2.id 
and 
(
    # START AFTER THE EVENT ONE 
    event_2.start_time >= event_1.start_time and 
    # ENDS BEFORE THE EVENT ONE 
    event_2.end_time <= event_1.end_time 
) 
    group by 
event_1.name 

結果:

+------------+----------------+ 
| event_name | overlaps_names | 
+------------+----------------+ 
| c   | d    | 
| f   | b,d,c   | 
+------------+----------------+ 

要檢測的部分重疊,則需要像這樣:

select 
    # EVENT NAME 
    event_1.name as event_name, 
    # LIST EVENTS THAT THE EVENT OVERLAPS 
    GROUP_CONCAT(event_2.name) as overlaps_names 
from events as event_1 
inner join events as event_2 
on 
    event_1.user_id = event_2.user_id 
and 
    event_1.id != event_2.id 
and 
(
    (
    # START AFTER THE EVENT ONE 
    event_2.start_time >= event_1.start_time and 
    # ENDS BEFORE THE EVENT ONE 
    event_2.start_time <= event_1.end_time 
    ) or 
    (
    # START AFTER THE EVENT ONE 
    event_2.end_time >= event_1.start_time and 
    # ENDS BEFORE THE EVENT ONE 
    event_2.end_time <= event_1.end_time 
    ) 
) 
    group by 
event_1.name 

結果:

+------------+----------------+ 
| event_name | overlaps_names | 
+------------+----------------+ 
| a   | b,f   | 
| b   | c,d,a   | 
| c   | b,d,e,g  | 
| d   | b,e   | 
| e   | f,g,d,c  | 
| f   | a,g,b,d,c,e | 
| g   | c,e,f   | 
+------------+----------------+ 

當然,我正在使用「g按照「更容易閱讀。如果你想在刪除之前總結或取平均重疊數據來更新你的父數據,那麼這也會很有用。也許這個「group_concat」函數不存在Postgres中或具有不同的名稱。一「標準的SQL」,你可以測試它是:

select 
    # EVENT NAME 
    event_1.name as event_name, 
    # LIST EVENTS THAT THE EVENT OVERLAPS 
    event_2.name as overlaps_name 
from events as event_1 
inner join events as event_2 
on 
    event_1.user_id = event_2.user_id 
and 
    event_1.id != event_2.id 
and 
(
    # START AFTER THE EVENT ONE 
    event_2.start_time >= event_1.start_time and 
    # ENDS BEFORE THE EVENT ONE 
    event_2.end_time <= event_1.end_time 
) 

結果:

+------------+---------------+ 
| event_name | overlaps_name | 
+------------+---------------+ 
| f   | b    | 
| f   | c    | 
| c   | d    | 
| f   | d    | 
+------------+---------------+ 

如果你想嘗試一些數學運算,記住增加的價值的風險「 c「和」d「數據放在」b「上,再將它們的值加到」f「上,使」f「的值錯誤。

// should be 
new f = old f + b + old c + d 
new c = old c + b + d // unecessary if you are going to delete it 

// very common mistake 
new c = old c + b + d // unecessary but not wrong yet 
new f = new c + b + d = (old c + b + d) + b + d // wrong!! 

您可以測試所有這些查詢並創建自己的在線將使用此URL http://sqlfiddle.com/#!9/1d2455/19同一個數據庫。但是,請記住它是Mysql,而不是Postgresql。但是測試標準SQL是非常好的。

+0

有關於將group_concat翻譯成Postgres的StackOverflow的一個線程https://stackoverflow.com/questions/2560946/postgresql-group-concat-equivalent。看起來很簡單。 –

+0

感謝您的回覆!我沒有結束這條路線,但它是一個有趣的方法。 – dynsne

+0

如果答案是有效的,不要忘記標記它。 –