LAG偏移

TL; DR：向下滾動到任務2LAG偏移

我處理以下數據集：

email,createdby,createdon 
[email protected],jsmith,2016-10-10 
[email protected],nsmythe,2016-09-09 
[email protected],vstark,2016-11-11 
[email protected],ajohnson,2015-02-03 
[email protected],elear,2015-01-01 
...

等。每封電子郵件都保證在數據集中至少有一個副本。

現在，有兩個任務需要解決;我解決了其中一個，但我正在與另一個掙扎。現在我將介紹這兩個任務的完整性。

TASK 1（解決）： 對於每一行，每封電子郵件，與與此電子郵件創建的第一個記錄的用戶名返回的附加列。

對於上述試樣數據集合預期結果：

email,createdby,createdon,original_createdby 
[email protected],jsmith,2016-10-10,nsmythe 
[email protected],nsmythe,2016-09-09,nsmythe 
[email protected],vstark,2016-11-11,nsmythe 
[email protected],ajohnson,2015-02-03,elear 
[email protected],elear,2015-01-01,elear

代碼以得到上面的：

;WITH q0 -- this is just a security measure in case there are unique emails in the data set 
      AS (SELECT t.email 
       FROM  t 
       GROUP BY t.email 
       HAVING COUNT(*) > 1) , 
     q1 
      AS (SELECT q0.email 
         , createdon 
         , createdby 
         , ROW_NUMBER() OVER (PARTITION BY q0.email ORDER BY createdon) rn 
       FROM  t 
       JOIN  q0 
         ON t.email = q0.email) 
    SELECT q1.email 
      , q1.createdon 
      , q1.createdby 
      , LAG(q1.createdby, q1.rn - 1) OVER (ORDER BY q1.email, q1.createdon) original_createdby 
    FROM q1 
    ORDER BY q1.email 
      , q1.rn

簡要說明：我分區數據通過電子郵件設置，那麼我在每個分區數目的行按創建日期排序，最後我從（rn-1）記錄返回[createdby]值。完全按照預期工作。

現在，類似上面有任務2：

任務2： 對於每一行，每封電子郵件，返回創建的第一個重複的用戶名。即其中rn = 2的用戶名稱。

預期結果：

email,createdby,createdon,first_dupl_createdby 
[email protected],jsmith,2016-10-10,jsmith 
[email protected],nsmythe,2016-09-09,jsmith 
[email protected],vstark,2016-11-11,jsmith 
[email protected],ajohnson,2015-02-03,ajohnson 
[email protected],elear,2015-01-01,ajohnson

我希望保持高性能，從而試圖採用超前滯後功能：

WITH q0 
      AS (SELECT t.email 
       FROM  t 
       GROUP BY t.email 
       HAVING COUNT(*) > 1) , 
     q1 
      AS (SELECT q0.email 
         , createdon 
         , createdby 
         , ROW_NUMBER() OVER (PARTITION BY q0.email ORDER BY createdon) rn 
       FROM  t 
       JOIN  q0 
         ON t.email = q0.email) 
    SELECT q1.email 
      , q1.createdon 
      , q1.createdby 
      , q1.rn 
      , CASE q1.rn 
       WHEN 1 THEN LEAD(q1.createdby, 1) OVER (ORDER BY q1.email, q1.createdon) 
       ELSE LAG(q1.createdby, q1.rn - 2) OVER (ORDER BY q1.email, q1.createdon) 
      END AS first_dupl_createdby 
    FROM q1 
    ORDER BY q1.email 
      , q1.rn

說明：在每個分區的第一個記錄，返回[createdby]來自以下記錄（即來自包含第一個副本的記錄）。對於同一分區中的所有其他記錄，從[rn-2]記錄前返回[createdby]（即對於rn = 2，我們保留在同一記錄上，對於rn = 3，我們將返回1記錄，對於rn = 4 - 2記錄等）。

一個問題出現在

ELSE LAG(q1.createdby, q1.rn - 2)

操作。顯然，對任何邏輯，儘管前面的行的存在（當1 THEN ...）時，ELSE塊也評價RN = 1，導致傳遞給LAG功能的負的偏移值：

消息8730，等級16，狀態2，行37 滯後和導聯函數的偏移參數不能爲負值。

當我註釋到ELSE行時，整個事情都很好，但顯然我沒有在first_dupl_createdby列中得到任何結果。

問題：是否有任何方式重寫上述CASE語句（在任務＃2中），以便它始終從每個分區中的rn = 2的記錄返回值，但這是重要的位 - 沒有進行自我JOIN操作（我知道我可以在單獨的子查詢中準備rn = 2的行，但是這意味着整個表上會有額外的掃描，並且還會運行不必要的自動JOIN）。

來源

2016-11-16 Piotr L

編輯你的問題，包括*結果*您希望得到您的樣本數據。 –

這可能聽起來很愚蠢，但是如果在'q1'中使用'ROW_NUMBER（）... + 2'作爲'rn'呢？在你的'case'表達式中，你可以使用'CASE q1.rn當3 then ...... ELSE LAG（q1.createdby，q1.rn）' – Lamak

我想你可以簡單地使用max窗口函數，因爲你試圖從rownumber = 2獲取每個分區的值。

SELECT q1.email 
      , q1.createdon 
      , q1.createdby 
      , q1.rn 
      , max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by 
FROM q1 
ORDER BY q1.email, q1.rn

您也可以使用類似的查詢來獲得第一場景的rownumber = 1的結果。

來源

2016-11-16 13:25:04

當你非常專注於特定的語言功能時這種情況：LAG/LEAD）你忘記了簡單的事情。這是最明顯的答案，我現在感到羞愧。謝謝。 –

你可以使用row_number()和條件聚合的各個電子郵件的信息：

select email, 
     max(case when seqnum = 1 then createdby end) as createdby_first, 
     max(case when seqnum = 2 then createdby end) as createdby_second 
from (select t.*, 
      row_number() over (partition by email order by createdon) as seqnum 
     from t 
    ) t 
group by email;

您可以join這一信息返回到原始數據，以獲得您想要的信息。我不明白lag()自然會被用來解決這個問題。

來源

2016-11-16 13:24:49

/聳肩

; WITH duplicate_email_addresses AS (
    SELECT email 
    FROM t 
    GROUP 
     BY email 
    HAVING Count(*) > 1 
) 
, records_with_duplicate_email_addresses AS (
    SELECT email 
     , createdon 
     , createdby 
     , Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer 
    FROM t 
    WHERE EXISTS (
      SELECT * 
      FROM duplicate_email_addresses 
      WHERE email = t.email 
     ) 
) 
, second_duplicate_record AS (-- Why do you need any more than this? 
    SELECT email 
     , createdon 
     , createdby 
    FROM records_with_duplicate_email_addresses 
    WHERE sequencer = 2 
) 
SELECT records_with_duplicate_email_addresses.email 
    , records_with_duplicate_email_addresses.createdon 
    , records_with_duplicate_email_addresses.createdby 
    , second_duplicate_record.createdby AS first_duplicate_createdby 
FROM records_with_duplicate_email_addresses 
INNER 
    JOIN second_duplicate_record 
    ON second_duplicate_record.email = records_with_duplicate_email_addresses.email 
;

來源

2016-11-16 13:30:43 gvee

這正是我試圖避免（自加入），但感謝您全面的SQL格式/命名課程。 –

回答

相關問題