2017-06-02 83 views
0

我正在運行PostgresSQL 9.6.2並且有一個包含7列大約2,900,000行的表。該表是臨時的,它是主題重複數據刪除過程的一部分,它旨在根據不同的規則集將新的id(s_id_new)分配給相同的主題。總的來說,我執行的內部連接大約10-12次,每次都是類似的,但稍有不同的數據子集/不同的WHERE條件/不同的連接列。多列優化Postgresql內部聯接(特別是自聯接)

現在,查詢效率很低,沒有完成(必須在2小時後取消)。

爲了優化的目的,我創建了一個數據子集(50000行)。

\d subject_subset; 
    Column  |   Type   | Modifiers 
----------------+------------------------+----------- 
s_id   | text     | 
surname_clean | character varying(20) | 
name_clean  | character varying(20) | 
fullname_clean | character varying(100) | 
id1   | character varying(20) | 
id2   | character varying(20) | 
id3   | character varying(20) | 
s_id_new  | character varying(20) | 
Indexes: 
    "subject_subset_s_id_new_idx" btree (s_id_new) 

我想查詢優化

select s_id_new, max(I_s_id) as s_id_deduplicated 
from (select a.*, b.s_id_new as I_s_id 
       from public.subject_subset a 
       inner join public.subject_subset b on a.surname_clean=b.surname_clean 
       and a.id2=b.id2 
       where 
        a.id1 is null 
        and a.id2 is not null 
        and a.surname_clean is not null) h 
group by s_id_new; 



The result of the EXPLAIN ANALYZE: 
https://explain.depesz.com/s/7knH 

"GroupAggregate (cost=5616.65..5620.39 rows=142 width=90) (actual time=32542.127..46938.858 rows=2889 loops=1)" 
" Group Key: a.s_id_new" 
" -> Sort (cost=5616.65..5617.42 rows=310 width=116) (actual time=32542.116..43194.626 rows=18356220 loops=1)" 
"  Sort Key: a.s_id_new" 
"  Sort Method: external merge Disk: 531760kB" 
"  -> Hash Join (cost=1114.72..5603.82 rows=310 width=116) (actual time=13.159..4892.011 rows=18356220 loops=1)" 
"    Hash Cond: (((b.surname_clean)::text = (a.surname_clean)::text) AND ((b.id2)::text = (a.id2)::text))" 
"    -> Seq Scan on subject_subset b (cost=0.00..1111.00 rows=50000 width=174) (actual time=0.011..10.775 rows=50000 loops=1)" 
"    -> Hash (cost=1111.00..1111.00 rows=248 width=174) (actual time=13.137..13.137 rows=15044 loops=1)" 
"     Buckets: 16384 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1151kB" 
"     -> Seq Scan on subject_subset a (cost=0.00..1111.00 rows=248 width=174) (actual time=0.005..9.330 rows=15044 loops=1)" 
"       Filter: ((id1 IS NULL) AND (id2 IS NOT NULL) AND (surname_clean IS NOT NULL))" 
"       Rows Removed by Filter: 34956" 
"Planning time: 0.236 ms" 
"Execution time: 47013.839 ms" 

至於我可以看到它的子查詢的是造成的問題,當全表進行排序消耗的超大空間排序,但我無法弄清楚如何優化它。

性能略有提高的唯一原因是分配新的整數ID與dense_rank,但它是不夠的。

+0

如果你用文字解釋這個特定查詢試圖完成的目標,這將有所幫助。否則,我們必須嘗試根據查詢來猜測任務。 – 2017-06-02 12:49:20

+0

該查詢旨在重複刪除主體 - 公司和自然人 - 爲其分配相同的ID。兩個具有相同文檔ID的Jonh Smiths在數據庫中具有不同的ID(s_id) - > Code爲他們分配一個新的ID =他現在擁有的s_id的最大值。有時輔助數據用於重複數據刪除(地址,電話等),但想法保持不變。 – Dominix

回答

0

大排序正在殺死你。

我有三個建議:

  1. 運行ANALYZE subject_subset來收集表表統計信息。 不會爲臨時表自動收集統計信息,您的情況下估算值相當不重要。

    也許這足以讓它變得更好!

  2. 嘗試索引(id2, surname_clean, s_id_new),這將有助於嵌套循環連接(不知道這是否更快)。

    你可以嘗試橫向加入像

    SELECT a.s_id_new, 
         max(b.i_s_id) AS s_id_deduplicated 
    FROM subject_subset a 
        CROSS JOIN LATERAL (SELECT s_id_new AS i_s_id 
             FROM subject_subset 
             WHERE a.surname_clean = surname_clean 
             AND a.id2 = id2 
             ORDER BY s_id_new DESC 
             LIMIT 1 
            ) b 
    GROUP BY a.s_id_new; 
    

    嵌套循環連接將是昂貴的,但那種要快。

  3. 堅持一個哈希聯接,但減少的行數:

    SELECT a.s_id_new, 
         max(b.i_s_id) AS s_id_deduplicated 
    FROM subject_subset a 
        JOIN (SELECT surname_clean, id2, 
           max(s_id_new) AS i_s_id 
         FROM subject_subset 
         GROUP BY surname_clean, id2 
         ) b 
         USING (surname_clean, id2) 
    WHERE a.id1 IS NULL 
        AND a.id2 IS NOT NULL 
        AND a.surname_clean IS NOT NULL 
    GROUP BY a.s_id_new; 
    

    也許在(surname_clean, id2)索引可以幫助,不知道。