2010-06-08 72 views
3

我知道這可能以前曾被問過,但是我找不到它與SO的搜索。在PostgreSQL中的IN語句性能(和一般情況下)

比方說我Table 1和Table,我應該怎麼指望一個查詢的性能像這樣:

SELECT * FROM TABLE1 WHERE id IN SUBQUERY_ON_TABLE2; 

下井爲行的Table 1和Table成長號和ID是TABLE1上的主鍵。

是的,我知道使用IN是這樣一個n00b錯誤,但TABLE2有一個通用關係(django通用關係)到多個其他表,所以我想不出另一種方式來過濾數據。在TABLE1和TABLE2中,我應該預計哪些(aproximate)行中的行數會因此而注意到性能問題?根據行數的不同,性能是否會呈線性,指數等降級?

回答

8

當子查詢返回的記錄數很少,並且由主查詢返回的結果行數也很小時,您只需在每個記錄上快速查找索引即可。隨着返回數據的百分比增加,最終每個人都將切換到使用順序掃描而不是索引的掃描,以一口吞下整個表格,而不是將其整合在一起。這不是一個簡單的性能下降,而是線性或指數性的;隨着計劃類型的變化,存在主要的不連續性。而這些發生的行數取決於表的大小,所以在那裏沒有有用的規則。你應該建立一個仿真模型,就像我在下面做的一樣,看看你自己的數據集上會發生什麼,以瞭解曲線的樣子。

下面是使用PostgreSQL 9.0數據庫加載Dell Store 2數據庫的工作原理示例。一旦子查詢返回1000行,它將對主表進行全表掃描。一旦子查詢正在考慮10,000條記錄,那麼也會變成全表掃描。這些每次運行兩次,所以你看到了緩存的性能。如何根據緩存狀態和未緩存狀態改變性能是一個完整的「另一個話題:

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
    (SELECT customerid FROM orders WHERE orderid<2); 
Nested Loop (cost=8.27..16.56 rows=1 width=268) (actual time=0.051..0.060 rows=1 loops=1) 
    -> HashAggregate (cost=8.27..8.28 rows=1 width=4) (actual time=0.028..0.030 rows=1 loops=1) 
     -> Index Scan using orders_pkey on orders (cost=0.00..8.27 rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1) 
       Index Cond: (orderid < 2) 
    -> Index Scan using customers_pkey on customers (cost=0.00..8.27 rows=1 width=268) (actual time=0.013..0.016 rows=1 loops=1) 
     Index Cond: (customers.customerid = orders.customerid) 
Total runtime: 0.191 ms 

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
    (SELECT customerid FROM orders WHERE orderid<100); 
Nested Loop (cost=10.25..443.14 rows=100 width=268) (actual time=0.488..2.591 rows=98 loops=1) 
    -> HashAggregate (cost=10.25..11.00 rows=75 width=4) (actual time=0.464..0.661 rows=98 loops=1) 
     -> Index Scan using orders_pkey on orders (cost=0.00..10.00 rows=100 width=4) (actual time=0.019..0.218 rows=99 loops=1) 
       Index Cond: (orderid < 100) 
    -> Index Scan using customers_pkey on customers (cost=0.00..5.75 rows=1 width=268) (actual time=0.009..0.011 rows=1 loops=98) 
     Index Cond: (customers.customerid = orders.customerid) 
Total runtime: 2.868 ms 

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
    (SELECT customerid FROM orders WHERE orderid<1000); 
Hash Semi Join (cost=54.25..800.13 rows=1000 width=268) (actual time=4.574..80.319 rows=978 loops=1) 
    Hash Cond: (customers.customerid = orders.customerid) 
    -> Seq Scan on customers (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..33.665 rows=20000 loops=1) 
    -> Hash (cost=41.75..41.75 rows=1000 width=4) (actual time=4.502..4.502 rows=999 loops=1) 
     Buckets: 1024 Batches: 1 Memory Usage: 24kB 
     -> Index Scan using orders_pkey on orders (cost=0.00..41.75 rows=1000 width=4) (actual time=0.056..2.487 rows=999 loops=1) 
       Index Cond: (orderid < 1000) 
Total runtime: 82.024 ms 

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
    (SELECT customerid FROM orders WHERE orderid<10000); 
Hash Join (cost=443.68..1444.68 rows=8996 width=268) (actual time=79.576..157.159 rows=7895 loops=1) 
    Hash Cond: (customers.customerid = orders.customerid) 
    -> Seq Scan on customers (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..27.085 rows=20000 loops=1) 
    -> Hash (cost=349.97..349.97 rows=7497 width=4) (actual time=79.532..79.532 rows=7895 loops=1) 
     Buckets: 1024 Batches: 1 Memory Usage: 186kB 
     -> HashAggregate (cost=275.00..349.97 rows=7497 width=4) (actual time=45.130..62.227 rows=7895 loops=1) 
       -> Seq Scan on orders (cost=0.00..250.00 rows=10000 width=4) (actual time=0.008..20.979 rows=9999 loops=1) 
        Filter: (orderid < 10000) 
Total runtime: 167.882 ms