2015-10-18 546 views
1

我有一個表的parameters_products約300k記錄。 是否有可能優化此查詢?如何優化PostgreSQL的COUNT GROUP BY查詢?

SELECT parameter_id AS id, 
     COUNT(product_id) AS COUNT 
FROM "parameters_products" 
WHERE product_id IN 
    (SELECT product_id 
    FROM parameters_products 
    WHERE parameter_id IN ('2')) 
GROUP BY parameter_id 

查詢輸出:

2;274669 

EXPLAIN ANALYZE VERBOSE ...輸出:

HashAggregate (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1) 
    Output: parameters_products.parameter_id, count(parameters_products.product_id) 
    Group Key: parameters_products.parameter_id 
    -> Hash Semi Join (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1) 
     Output: parameters_products.parameter_id, parameters_products.product_id 
     Hash Cond: (parameters_products.product_id = parameters_products_1.product_id) 
     -> Seq Scan on public.parameters_products (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1) 
       Output: parameters_products.parameter_id, parameters_products.product_id 
     -> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1) 
       Output: parameters_products_1.product_id 
       Buckets: 16384 Batches: 4 Memory Usage: 2425kB 
       -> Seq Scan on public.parameters_products parameters_products_1 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1) 
        Output: parameters_products_1.product_id 
        Filter: (parameters_products_1.parameter_id = 2) 
        Rows Removed by Filter: 25059 
Planning time: 0.279 ms 
Execution time: 2231.499 ms 

的PostgreSQL 9.4.1,並真空啓用。

只是嘗試這樣做quesry,但實在是太慢了:

SELECT pp1.parameter_id, 
     count(pp1.product_id) 
FROM parameters_products pp1 
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id 
WHERE pp2.parameter_id IN (2) 
GROUP BY pp1.parameter_id 

-

HashAggregate (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1) 
    Output: pp1.parameter_id, count(pp1.product_id) 
    Group Key: pp1.parameter_id 
    -> Hash Join (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1) 
     Output: pp1.parameter_id, pp1.product_id 
     Hash Cond: (pp1.product_id = pp2.product_id) 
     -> Seq Scan on public.parameters_products pp1 (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1) 
       Output: pp1.parameter_id, pp1.product_id 
     -> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1) 
       Output: pp2.product_id 
       Buckets: 16384 Batches: 4 Memory Usage: 2425kB 
       -> Seq Scan on public.parameters_products pp2 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1) 
        Output: pp2.product_id 
        Filter: (pp2.parameter_id = 2) 
        Rows Removed by Filter: 25059 
Planning time: 0.135 ms 
Execution time: 2361.735 ms 

指標:

CREATE INDEX parameters_products_parameter_id_idx 
    ON parameters_products 
    USING btree 
    (parameter_id); 

CREATE INDEX parameters_products_product_id_idx 
    ON parameters_products 
    USING btree 
    (product_id); 

CREATE INDEX parameters_products_product_id_parameter_id_idx 
    ON parameters_products 
    USING btree 
    (product_id, parameter_id); 

EXPLAIN ANALYZE VERBOSE 
SELECT pp1.parameter_id 
FROM parameters_products pp1 
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id 

-

Hash Left Join (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1) 
    Output: pp1.parameter_id 
    Hash Cond: (pp1.product_id = pp2.product_id) 
    -> Seq Scan on public.parameters_products pp1 (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1) 
     Output: pp1.parameter_id, pp1.product_id 
    -> Hash (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1) 
     Output: pp2.product_id 
     Buckets: 16384 Batches: 4 Memory Usage: 2644kB 
     -> Seq Scan on public.parameters_products pp2 (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1) 
       Output: pp2.product_id 
Planning time: 0.472 ms 
Execution time: 2392.582 ms 

SET enable_seqscan = OFF; 

降低了執行時間,但不顯著。

+1

用'JOIN'替換'WHERE IN'# – lad2025

+1

@ lad2025執行時間:2361。735 ms – nanolab

+0

BTW:'WHERE parameter_id IN('2'))'''''''''''''''''''''' – wildplasser

回答

2

我想嘗試的第一件事就是用EXISTS替換IN

SELECT parameter_id AS id, 
     COUNT(product_id) AS COUNT 
FROM parameters_products pp 
WHERE EXISTS (SELECT 1 
       FROM parameters_products pp2 
       WHERE pp2.product_id = pp.product_id AND 
        pp2.parameter_id = 2 
      ) 
GROUP BY parameter_id; 

而且,一定要在parameters_products(product_id, parameter_id)有一個索引。

另一個想法是使用窗口功能:

select parameter_id, count(*) 
from (select pp.*, 
      sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2 
     from parameters_products pp 
    ) pp 
where cnt2 > 0 
group by parameter_id; 
+0

指數已經存在立場: CREATE INDEX parameters_products_product_id_parameter_id_idx ON parameters_products 使用B樹 (產品, parameter_id);對於第一個查詢「執行時間:2239.944毫秒」,對於第二個「執行時間:2526.269毫秒」 – nanolab

+0

您可以嘗試使用與...相反的順序的索引'... parameter_products(parameter_id,product_id)'的索引。 – wildplasser

+0

@wildplasser還在同一時間 – nanolab

1

嘗試:

SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT 
FROM parameters_products pp1 
JOIN 
    parameters_products pp2 
ON 
    pp2.parameter_id = 2 
AND 
    pp1.product_id = pp2.product_id 
GROUP BY 
    pp1.parameter_id 

從您的WHERE子句ON子句減少參與JOIN行的總數移動過濾條件。希望這可以證明您在評論中看到的節省的執行時間低於1秒。

+0

對不起,但這是錯誤的查詢。 JOIN沒有意義。 的結果將是相同的:SELECT pp1.parameter_id AS ID,COUNT(pp1.product_id)AS COUNT FROM parameters_products PP1 GROUP BY pp1.parameter_id – nanolab

+0

@nanolab我已經更新我的答案和使用內部聯接,而不是一個左連接來準確地重現WHERE子句的結果。我對此表示歉意。 – AlVaz

+0

是的,現在是正確的。但是「執行時間:2249.975毫秒」。似乎它不可能優化。 – nanolab

0

在freenode的#postgresql中的RhodiumToad建議使用如下的窗口函數。請注意,這是比使用,而不是總和bool_or(案例...)戈登·利諾夫的窗口功能不同:

SELECT parameter_id, count(product_id) 
FROM 
    (SELECT *, bool_or(parameter_id = 2) 
    OVER 
    (partition by product_id) AS matching 
    FROM parameters_products) s 
WHERE matching 
GROUP BY parameter_id; 

RhodiumToad還提到,work_mem參數可以是這種規模的任何查詢過小,無論是使用窗口函數,連接或子查詢。他建議增加work_mem參數以避免將例程溢出到磁盤。

如果其中任何一種能幫助您,所有功勞歸於RhodiumToad。

+0

使用此SET LOCAL work_mem ='500MB';但查詢甚至比其他人慢:「執行時間:3079.735毫秒」 – nanolab

+0

@nanolab您可以發佈EXPLAIN ANALYZE嗎? – AlVaz

+0

http://pastebin.com/raw.php?i=rySnHQh8 – nanolab