將1000個ID編入SELECT ... WHERE ... IN（...）查詢Postgres是否合理？

可能重複：
PostgreSQL - max number of parameters in 「IN」 clause?將1000個ID編入SELECT ... WHERE ... IN（...）查詢Postgres是否合理？

我正在開發一個web API來上很好地映射到一個Postgres表資源執行RESTful的查詢。大多數過濾參數也很好地映射到SQL查詢上的參數。但是，一些過濾參數需要調用我的搜索索引（在這種情況下，是一個Sphinx服務器）。

最簡單的做法是運行我的搜索，從搜索結果中收集主鍵，然後將這些鍵填充到SQL查詢的IN (...)子句中。但是，由於搜索可能會返回很多主鍵，所以我想知道這是否是一個好主意。

我預計大部分時間（比如說90％），我的搜索將會返回幾百個結果。也許10％的時間，會有幾千個結果。

這是一個合理的方法嗎？有沒有更好的辦法？

2012-03-28 David Eyk

也許，但可能並不完全。這個問題是要求IN子句的最大尺寸。我問，什麼是合理的？或者這太主觀了嗎？我會看到有關重述這個問題。 – 2012-03-28 17:12:05

你不能這樣做：Where ... In（non_indexed_param1，non_indexed_param2，...）或... In（選擇... from search_index where ...）而不是使用兩個單獨的查詢？ – beny23 2012-03-28 16:40:34

也許我對我的語言過於寬鬆。「搜索索引」是指搜索服務器，在這種情況下是Sphinx。我在這個問題上澄清了這一點。 – 2012-03-28 17:08:10

我堅決贊成的實驗方法來回答的性能問題。 @Catcall開創了一個不錯的開始，但是他的實驗規模比許多真實的數據庫要小得多。他的300000個單整數行很容易適應內存，所以沒有IO發生;此外他沒有分享實際的數字。

我編寫了一個類似的實驗，但將樣本數據大小設置爲我主機上可用內存的7倍（在1GB 1分頻CPU VM，NFS掛載文件系統上的7GB數據集）。有三千萬行由單個索引bigint和一個0到400字節的隨機長度字符串組成。

create table t(id bigint primary key, stuff text); 
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer) 
from generate_series(0,30000000) i; 
analyze t;

以下內容解釋分析密鑰域中10,100,1,000,10,000和100,000個隨機整數集合中的select IN的運行時間。每個查詢都是以下面的形式，其中$ 1取代了設置計數。

explain analyze 
select id from t 
where id in (
    select (random()*30000000)::integer from generate_series(0,$1) 
);

摘要時報

CT，TOT MS，MS /行
10，84，8.4
100，1185，11.8
1000，12407，12.4
10,000，109747,11。0
100000，1016842，10.1

注意保持計劃爲每個相同的IN組基數 - 建立隨機整數的散列聚集，然後循環和與每個值做一個索引查找。取數時間與IN集的基數線性接近，在8-12 ms /行範圍內。更快的存儲系統無疑可以顯着提高這些時間，但實驗表明，Pg在處理IN子句中處理非常大的集合 - 至少從執行速度的角度來看。注意，如果你通過綁定參數或sql語句的文字插值來提供列表，你會在查詢到服務器的網絡傳輸上產生額外的開銷，並增加解析時間，儘管我懷疑它們相比於IO而言可以忽略不計執行查詢的時間。

# fetch 10 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 84.580 ms 


# fetch 100 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 1184.843 ms 

# fetch 1,000 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 12407.059 ms 

# fetch 10,000 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 109747.169 ms 

# fetch 100,000 
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1) 
    -> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1) 
     -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816) 
     Index Cond: (t.id = (((random() * 30000000::double precision))::integer)) 
Total runtime: 1016842.772 ms

在@Catcall的請求，我運行了類似的查詢使用CTE和臨時表。兩種方法都有相對簡單的嵌套循環索引掃描計劃，並且與內聯IN查詢的運行時間相當（儘管稍慢）。如果再次運行

-- CTE 
EXPLAIN analyze 
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000)) 
select id from t where id in (select ids.val from ids); 

Nested Loop (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1) 
    CTE ids 
    -> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1) 
    -> HashAggregate (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1) 
     -> CTE Scan on ids (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001) 
     Index Cond: (t.id = ids.val) 
Total runtime: 12878.812 ms 
(8 rows) 

-- Temp table 
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000); 

explain analyze select id from t where t.id in (select val from temp_ids); 

Nested Loop (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1) 
    -> HashAggregate (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1) 
     -> Seq Scan on temp_ids (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001) 
     Index Cond: (t.id = temp_ids.val) 
Total runtime: 15725.063 ms 

-- another way using join against temptable insteed of IN 
explain analyze select id from t join temp_ids on (t.id = temp_ids.val); 

Nested Loop (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1) 
    -> Seq Scan on temp_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1) 
    -> Index Scan using t_pkey on t (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001) 
     Index Cond: (t.id = temp_ids.val) 
Total runtime: 16558.331 ms

臨時表查詢跑得非常快很多，但那是因爲該值id集是恆定的，所以目標數據在緩存中的新鮮和PG確實沒有真正的IO執行第二次。

來源

2012-03-29 01:12:01 dbenhur

您是否還對臨時表上的聯接和公共表表達式上的聯接進行了測試？ – 2012-03-29 13:16:51

@Catcall add在答案中與CTE和臨時表一起運行 – dbenhur 2012-03-29 21:15:41

我有些天真的測試表明，使用IN (...)至少比臨時表上的連接和公共表表達式上的連接快一個數量級。（坦率地說，這令我感到驚訝。）我從一個300000行的表中測試了3000個整數值。

create table integers (
    n integer primary key 
); 
insert into integers 
select generate_series(0, 300000); 

-- External ruby program generates 3000 random integers in the range of 0 to 299999. 
-- Used Emacs to massage the output into a SQL statement that looks like 

explain analyze 
select integers.n 
from integers where n in (
100109, 
100354 , 
100524 , 
... 
);

來源

2012-03-28 17:51:52

有趣！感謝您的測試。 – 2012-03-28 18:36:11

回覆@Catcall的帖子。我忍不住要對它進行雙重測試。這太神奇了！反直覺。執行計劃是相似的（使用隱式指數雙雙查詢）SELECT ... IN ...： enter image description here 和SELECT ... JOIN ...：

CREATE TABLE integers (
    n integer PRIMARY KEY 
); 
INSERT INTO integers 
SELECT generate_series(0, 300000); 

CREATE TABLE search ( n integer); 

-- Generate INSERTS and SELECT ... WHERE ... IN (...) 
SELECT 'SELECT integers.n 
FROM integers WHERE n IN (' || list || ');', 
' INSERT INTO search VALUES ' 
|| values ||'; ' FROM (
SELECT string_agg(n::text, ',') AS list, string_agg('('||n::text||')', ',') AS values FROM (
SELECT n FROM integers ORDER BY random() LIMIT 3000) AS elements) AS raw 


INSERT INTO search VALUES (9155),(189177),(18815),(13027),... ; 

EXPLAIN SELECT integers.n 
FROM integers WHERE n IN (9155,189177,18815,13027,...); 

EXPLAIN SELECT integers.n FROM integers JOIN search ON integers.n = search.n;

來源

2012-03-28 23:37:53 nad2000

將1000個ID編入SELECT ... WHERE ... IN（...）查詢Postgres是否合理？

回答

相關問題