2014-02-23 53 views
1

我有一個非常奇怪的數據集,其中來自大表的幾個記錄根本沒有任何數據,但是當他們這樣做的時候,它是成百上千的記錄。 我想選擇只有有數據的記錄,但我在索引使用方面有一些問題。我知道你通常不能「強迫」postgresql使用某些索引,但在這種情況下它可以工作。Postgresql索引未使用

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) > 0 ORDER BY count(frames.id) DESC; 
id | count 
----+-------- 
31 | 123363 
28 | 121475 
24 | 110155 
21 | 108258 
22 | 106837 
25 | 89182 
26 | 87104 
27 | 86152 
(8 rows) 

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) = 0 ORDER BY count(frames.id) DESC; 
.... 
(568 rows) 

兩個解決方案,我發現是:

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1); 
Time: 11697,645 ms 


or 

SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id" 
Time: 879,325 ms 

無論是查詢似乎在框架臺上match_id使用索引。由於通常它不是非常有選擇性,所以它是可以伸縮的,不幸的是在這裏它會非常有幫助。爲:

SET enable_seqscan = OFF; 
SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1); 
Time: 1,239 ms 

解釋查詢:

EXPLAIN for: SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id" 

           QUERY PLAN 
----------------------------------------------------------------------------- 
HashAggregate (cost=59253.47..59256.38 rows=290 width=155) 
    -> Hash Join (cost=6.26..33716.73 rows=785746 width=155) 
     Hash Cond: (frames.match_id = matches.id) 
     -> Seq Scan on frames (cost=0.00..22906.46 rows=785746 width=4) 
     -> Hash (cost=4.45..4.45 rows=145 width=155) 
       -> Seq Scan on matches (cost=0.00..4.45 rows=145 width=155) 
(6 rows) 

解釋:SELECT 「匹配」 * FROM 「匹配」 WHERE(EXISTS(SELECT ID FROM幀WHERE frames.match_id = matches.id LIMIT。 1)) 查詢計劃


Seq Scan on matches (cost=0.00..41.17 rows=72 width=155) 
    Filter: (SubPlan 1) 
    SubPlan 1 
    -> Limit (cost=0.00..0.25 rows=1 width=4)                              
     -> Seq Scan on frames (cost=0.00..24870.83 rows=98218 width=4)                       
       Filter: (match_id = matches.id)                              

(6行)

SET enable_seqscan = OFF;

EXPLAIN SELECT「matches」。* FROM「matches」WHERE(SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1); 查詢計劃


Seq Scan on matches (cost=10000000000.00..10000000118.37 rows=72 width=155) 
    Filter: (SubPlan 1) 
    SubPlan 1 
    -> Limit (cost=0.00..0.79 rows=1 width=0) 
      -> Index Scan using index_frames_on_match_id on frames (cost=0.00..81762.68 rows=104066 width=0) 
       Index Cond: (match_id = matches.id) 

(6行)

任何建議如何tweek在這裏使用索引的查詢?也許其他的方式來檢查recrs的存在將執行接近1ms我擺脫索引然後11s?

PS。我確實運行了ANALYZE,VACUM ANALYZE,通常建議的所有步驟以改進索引使用。

編輯感謝大衛 - 阿爾德里奇指出LIMIT 1可能會阻礙真正的查詢規劃現在我已經得到了:

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id); 
Time: 163,803 ms 

的計劃:用慢

EXPLAIN SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id); 
            QUERY PLAN          
------------------------------------------------------------------------------------ 
Nested Loop (cost=25455.58..25457.90 rows=8 width=155) 
    -> HashAggregate (cost=25455.58..25455.66 rows=8 width=4) 
     -> Seq Scan on frames (cost=0.00..23374.26 rows=832526 width=4) 
    -> Index Scan using matches_pkey on matches (cost=0.00..0.27 rows=1 width=155) 
     Index Cond: (id = frames.match_id) 
(5 rows) 

仍然是100倍僅索引版本(可能是因爲在仍然執行的幀上的Seq掃描+哈希聚合)

+0

什麼版本的PostgreSQL您使用的是? –

+0

9.1我剛剛在9.3上進行了測試,看起來沒有LIMIT 1的查詢正確使用索引。看起來像LIMIT 1將所有東西搞砸了+查詢優化器在9.1和9.3之間陷入了很多 –

回答

2

在基於EXISTS的替代方案中,LIMIT子句是多餘的,但可能不是在幫助優化者。

嘗試:

SELECT "matches".* 
FROM "matches" 
WHERE EXISTS (SELECT 1 
       FROM frames 
       WHERE frames.match_id = matches.id); 
+0

你說得對。LIMIT 1肯定會妨礙查詢。它仍然是100ms(seqscan關閉時速度會降低100倍),但比以前的選擇速度快得多。 –

+0

你得到什麼執行計劃? –

+0

用新數據更新了問題。 –