我堅決贊成的實驗方法來回答的性能問題。 @Catcall開創了一個不錯的開始,但是他的實驗規模比許多真實的數據庫要小得多。他的300000個單整數行很容易適應內存,所以沒有IO發生;此外他沒有分享實際的數字。
我編寫了一個類似的實驗,但將樣本數據大小設置爲我主機上可用內存的7倍(在1GB 1分頻CPU VM,NFS掛載文件系統上的7GB數據集)。有三千萬行由單個索引bigint和一個0到400字節的隨機長度字符串組成。
create table t(id bigint primary key, stuff text);
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer)
from generate_series(0,30000000) i;
analyze t;
以下內容解釋分析密鑰域中10,100,1,000,10,000和100,000個隨機整數集合中的select IN的運行時間。每個查詢都是以下面的形式,其中$ 1取代了設置計數。
explain analyze
select id from t
where id in (
select (random()*30000000)::integer from generate_series(0,$1)
);
摘要時報
- CT,TOT MS,MS /行
- 10,84,8.4
- 100,1185,11.8
- 1000,12407,12.4
- 10,000,109747,11。0
- 100000,1016842,10.1
注意保持計劃爲每個相同的IN組基數 - 建立隨機整數的散列聚集,然後循環和與每個值做一個索引查找。取數時間與IN集的基數線性接近,在8-12 ms /行範圍內。更快的存儲系統無疑可以顯着提高這些時間,但實驗表明,Pg在處理IN子句中處理非常大的集合 - 至少從執行速度的角度來看。注意,如果你通過綁定參數或sql語句的文字插值來提供列表,你會在查詢到服務器的網絡傳輸上產生額外的開銷,並增加解析時間,儘管我懷疑它們相比於IO而言可以忽略不計執行查詢的時間。
# fetch 10
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 84.580 ms
# fetch 100
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 1184.843 ms
# fetch 1,000
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 12407.059 ms
# fetch 10,000
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 109747.169 ms
# fetch 100,000
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 1016842.772 ms
在@Catcall的請求,我運行了類似的查詢使用CTE和臨時表。兩種方法都有相對簡單的嵌套循環索引掃描計劃,並且與內聯IN查詢的運行時間相當(儘管稍慢)。如果再次運行
-- CTE
EXPLAIN analyze
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000))
select id from t where id in (select ids.val from ids);
Nested Loop (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1)
CTE ids
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1)
-> HashAggregate (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1)
-> CTE Scan on ids (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001)
Index Cond: (t.id = ids.val)
Total runtime: 12878.812 ms
(8 rows)
-- Temp table
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000);
explain analyze select id from t where t.id in (select val from temp_ids);
Nested Loop (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1)
-> HashAggregate (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1)
-> Seq Scan on temp_ids (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001)
Index Cond: (t.id = temp_ids.val)
Total runtime: 15725.063 ms
-- another way using join against temptable insteed of IN
explain analyze select id from t join temp_ids on (t.id = temp_ids.val);
Nested Loop (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1)
-> Seq Scan on temp_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001)
Index Cond: (t.id = temp_ids.val)
Total runtime: 16558.331 ms
臨時表查詢跑得非常快很多,但那是因爲該值id集是恆定的,所以目標數據在緩存中的新鮮和PG確實沒有真正的IO執行第二次。
也許,但可能並不完全。這個問題是要求IN子句的最大尺寸。我問,什麼是合理的?或者這太主觀了嗎?我會看到有關重述這個問題。 – 2012-03-28 17:12:05
你不能這樣做:Where ... In(non_indexed_param1,non_indexed_param2,...)或... In(選擇... from search_index where ...)而不是使用兩個單獨的查詢? – beny23 2012-03-28 16:40:34
也許我對我的語言過於寬鬆。 「搜索索引」是指搜索服務器,在這種情況下是Sphinx。我在這個問題上澄清了這一點。 – 2012-03-28 17:08:10