2012-03-28 90 views
9

可能重複:
PostgreSQL - max number of parameters in 「IN」 clause?將1000個ID編入SELECT ... WHERE ... IN(...)查詢Postgres是否合理?

我正在開發一個web API來上很好地映射到一個Postgres表資源執行RESTful的查詢。大多數過濾參數也很好地映射到SQL查詢上的參數。但是,一些過濾參數需要調用我的搜索索引(在這種情況下,是一個Sphinx服務器)。

最簡單的做法是運行我的搜索,從搜索結果中收集主鍵,然後將這些鍵填充到SQL查詢的IN (...)子句中。但是,由於搜索可能會返回很多主鍵,所以我想知道這是否是一個好主意。

我預計大部分時間(比如說90%),我的搜索將會返回幾百個結果。也許10%的時間,會有幾千個結果。

這是一個合理的方法嗎?有沒有更好的辦法?

+1

也許,但可能並不完全。這個問題是要求IN子句的最大尺寸。我問,什麼是合理的?或者這太主觀了嗎?我會看到有關重述這個問題。 – 2012-03-28 17:12:05

+0

你不能這樣做:Where ... In(non_indexed_pa​​ram1,non_indexed_pa​​ram2,...)或... In(選擇... from search_index where ...)而不是使用兩個單獨的查詢? – beny23 2012-03-28 16:40:34

+0

也許我對我的語言過於寬鬆。 「搜索索引」是指搜索服務器,在這種情況下是Sphinx。我在這個問題上澄清了這一點。 – 2012-03-28 17:08:10

回答

14

我堅決贊成的實驗方法來回答的性能問題。 @Catcall開創了一個不錯的開始,但是他的實驗規模比許多真實的數據庫要小得多。他的300000個單整數行很容易適應內存,所以沒有IO發生;此外他沒有分享實際的數字。

我編寫了一個類似的實驗,但將樣本數據大小設置爲我主機上可用內存的7倍(在1GB 1分頻CPU VM,NFS掛載文件系統上的7GB數據集)。有三千萬行由單個索引bigint和一個0到400字節的隨機長度字符串組成。

create table t(id bigint primary key, stuff text); 
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer) 
from generate_series(0,30000000) i; 
analyze t; 

以下內容解釋分析密鑰域中10,100,1,000,10,000和100,000個隨機整數集合中的select IN的運行時間。每個查詢都是以下面的形式,其中$ 1取代了設置計數。

explain analyze 
select id from t 
where id in (
    select (random()*30000000)::integer from generate_series(0,$1) 
); 

摘要時報

  • CT,TOT MS,MS /行
  • 10,84,8.4
  • 100,1185,11.8
  • 1000,12407,12.4
  • 10,000,109747,11。0
  • 100000,1016842,10.1

注意保持計劃爲每個相同的IN組基數 - 建立隨機整數的散列聚集,然後循環和與每個值做一個索引查找。取數時間與IN集的基數線性接近,在8-12 ms /行範圍內。更快的存儲系統無疑可以顯着提高這些時間,但實驗表明,Pg在處理IN子句中處理非常大的集合 - 至少從執行速度的角度來看。注意,如果你通過綁定參數或sql語句的文字插值來提供列表,你會在查詢到服務器的網絡傳輸上產生額外的開銷,並增加解析時間,儘管我懷疑它們相比於IO而言可以忽略不計執行查詢的時間。

# fetch 10 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 84.580 ms 


# fetch 100 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 1184.843 ms 

# fetch 1,000 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 12407.059 ms 

# fetch 10,000 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 109747.169 ms 

# fetch 100,000 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 1016842.772 ms 

在@Catcall的請求,我運行了類似的查詢使用CTE和臨時表。兩種方法都有相對簡單的嵌套循環索引掃描計劃,並且與內聯IN查詢的運行時間相當(儘管稍慢)。如果再次運行

-- CTE 
EXPLAIN analyze 
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000)) 
select id from t where id in (select ids.val from ids); 

Nested Loop (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1) 
    CTE ids 
    -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1) 
    -> HashAggregate (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1) 
     -> CTE Scan on ids (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001) 
     Index Cond: (t.id = ids.val) 
Total runtime: 12878.812 ms 
(8 rows) 

-- Temp table 
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000); 

explain analyze select id from t where t.id in (select val from temp_ids); 

Nested Loop (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1) 
    -> HashAggregate (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1) 
     -> Seq Scan on temp_ids (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001) 
     Index Cond: (t.id = temp_ids.val) 
Total runtime: 15725.063 ms 

-- another way using join against temptable insteed of IN 
explain analyze select id from t join temp_ids on (t.id = temp_ids.val); 

Nested Loop (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1) 
    -> Seq Scan on temp_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001) 
     Index Cond: (t.id = temp_ids.val) 
Total runtime: 16558.331 ms 

臨時表查詢跑得非常快很多,但那是因爲該值id集是恆定的,所以目標數據在緩存中的新鮮和PG確實沒有真正的IO執行第二次。

+0

您是否還對臨時表上的聯接和公共表表達式上的聯接進行了測試? – 2012-03-29 13:16:51

+0

@Catcall add在答案中與CTE和臨時表一起運行 – dbenhur 2012-03-29 21:15:41

4

我有些天真的測試表明,使用IN (...)至少比臨時表上的連接和公共表表達式上的連接快一個數量級。 (坦率地說,這令我感到驚訝。)我從一個300000行的表中測試了3000個整數值。

create table integers (
    n integer primary key 
); 
insert into integers 
select generate_series(0, 300000); 

-- External ruby program generates 3000 random integers in the range of 0 to 299999. 
-- Used Emacs to massage the output into a SQL statement that looks like 

explain analyze 
select integers.n 
from integers where n in (
100109, 
100354 , 
100524 , 
... 
); 
+0

有趣!感謝您的測試。 – 2012-03-28 18:36:11

3

回覆@Catcall的帖子。我忍不住要對它進行雙重測試。這太神奇了!反直覺。執行計劃是相似的(使用隱式指數雙雙查詢)SELECT ... IN ...enter image description hereSELECT ... JOIN ...enter image description here

CREATE TABLE integers (
    n integer PRIMARY KEY 
); 
INSERT INTO integers 
SELECT generate_series(0, 300000); 

CREATE TABLE search ( n integer); 

-- Generate INSERTS and SELECT ... WHERE ... IN (...) 
SELECT 'SELECT integers.n 
FROM integers WHERE n IN (' || list || ');', 
' INSERT INTO search VALUES ' 
|| values ||'; ' FROM (
SELECT string_agg(n::text, ',') AS list, string_agg('('||n::text||')', ',') AS values FROM (
SELECT n FROM integers ORDER BY random() LIMIT 3000) AS elements) AS raw 


INSERT INTO search VALUES (9155),(189177),(18815),(13027),... ; 

EXPLAIN SELECT integers.n 
FROM integers WHERE n IN (9155,189177,18815,13027,...); 

EXPLAIN SELECT integers.n FROM integers JOIN search ON integers.n = search.n; 
相關問題