PostgreSQL的指數不使用

我有幾百萬行稱爲項目的列看起來像這樣的一個表：PostgreSQL的指數不使用

CREATE TABLE item (
    id bigint NOT NULL, 
    company_id bigint NOT NULL, 
    date_created timestamp with time zone, 
    .... 
)

有此表是經常搜索的COMPANY_ID

CREATE INDEX idx_company_id ON photo USING btree (company_id);

指數某個客戶的最後10個項目，即

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

目前，有一個客戶accoun對於該表中約75％的數據，其他25％的數據分佈在25個左右的其他客戶，這意味着75％的行具有5的公司ID，其他行的公司id在6和25.

對於除了主要公司（id = 5）之外的所有公司，查詢通常運行速度非常快。我可以理解爲什麼自上COMPANY_ID指數可用於公司除了5

我有不同的索引嘗試使搜索更有效的爲公司5，似乎最有意義的一個是

CREATE INDEX idx_date_created 
ON item (date_created DESC NULLS LAST);

如果我添加了這個索引，對於主要公司（id = 5）的查詢得到了很大的改善，但是對所有其他公司的查詢都變成廢話。

解釋一些結果分析公司ID 5 & 6使用和不使用新的索引：

公司ID 5

新的索引

QUERY PLAN 
Limit (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1) 
    -> Sort (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1) 
     Sort Key: photo_created 
     Sort Method: top-N heapsort Memory: 35kB 
     -> Seq Scan on photo (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1) 
       Filter: (company_id = 5) 
       Rows Removed by Filter: 402513 
Total runtime: 10482.075 ms

之前之後的新指標：

QUERY PLAN 
Limit (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1) 
    -> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1) 
     Filter: (company_id = 5) 
     Rows Removed by Filter: 26 
Total runtime: 0.164 ms

公司編號6

新的索引之前：

QUERY PLAN 
Limit (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1) 
    -> Sort (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1) 
     Sort Key: photo_created 
     Sort Method: quicksort Memory: 28kB 
     -> Index Scan using idx_photo__company_id on photo (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1) 
       Index Cond: (company_id = 6) 
Total runtime: 0.100 ms

後的新指數：

QUERY PLAN 
Limit (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1) 
    -> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1) 
     Filter: (company_id = 6) 
     Rows Removed by Filter: 1876071 
Total runtime: 3939.028 ms

我已經運行一個完整的真空和分析有關表格，讓PostgreSQL的應該有向上的最新統計數據。任何想法如何讓PostgreSQL爲被查詢的公司選擇正確的索引？

來源

2017-06-22 Mike

我的猜測是'LIMIT'是作弊。但是如果你用'ANALYZE'提供'EXPLAIN'會更清楚，它將幫助我們檢查用於規劃者的表格統計信息。順便說一句，你是否正在運行'VACUUM ANALYZE'？ –

有多少獨特的'company_id'？表中有多少百分比是'company_id = 5'？ – jmelesky

編輯我的帖子以添加更多詳細信息，感謝您的幫助 – Mike

這就是衆所周知的"abort-early plan problem"，這是多年來的一個長期的錯誤優化。中止早期計劃在工作時是驚人的，但當他們不工作時很糟糕;請參閱鏈接的郵件列表線索以獲取更詳細的解釋。基本上，計劃者認爲它會找到客戶6想要的10行而不掃描整個date_created索引，這是錯誤的。

在PostgreSQL 10之前（沒有測試版），沒有任何硬性和快速的方法來改進這個查詢。你想要做的是以各種方式推動查詢計劃者，希望得到你想要的。主要方法有什麼使PostgreSQL的更有可能使用多列索引，如：（如果你是在SSD上是一個好主意無論如何）

降低random_page_cost。
降低cpu_index_tuple_cost

這也有可能是你可以用表的統計信息打修復規劃師的行爲。這包括：

提高statistics_target爲表和運行重新分析，爲了讓PostgreSQL採取更多的樣本，並得到排分佈的更好的畫面;
在統計信息中增加n_distinct以準確反映customer_ids或不同created_dates的數量。

但是，所有這些解決方案都是近似的，並且如果在將來您的數據發生更改時查詢性能變差，這應該是您查看的第一個查詢。

在PostgreSQL 10中，您將能夠創建Cross-Column Stats，這應該可以更可靠地改善情況。取決於這是如何損壞你，你可以嘗試使用測試版。

如果這些都不起作用，我建議Freenode上的#postgresql IRC頻道或pgsql-performance mailing list。那裏的人會要求你詳細的表格統計，以便提出一些建議。

來源

2017-06-22 22:45:52 FuzzyChef

感謝您的解釋，它似乎是一個「bug」在PostgreSQL中。我可以用部分索引來解決這個具體情況。一種混合物，但現在會給我買時間。 – Mike

另一個點：爲什麼你創建索引

CREATE INDEX idx_date_created ON item (date_created DESC NULLS LAST);

但撥打：

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

可能是你的意思是

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created DESC NULLS LAST LIMIT 10;

而且最好創建結合指數：

CREATE INDEX idx_company_id_date_created ON item (company_id, date_created DESC NULLS LAST);

而在這之後：

                 QUERY PLAN                  
------------------------------------------------------------------------------------------------------------------------------------------------------ 
Limit (cost=0.43..28.11 rows=10 width=16) (actual time=0.120..0.153 rows=10 loops=1) 
    -> Index Only Scan using idx_company_id_date_created on item (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.118..0.145 rows=10 loops=1) 
     Index Cond: (company_id = 5) 
     Heap Fetches: 10 
Planning time: 1.003 ms 
Execution time: 0.209 ms 
(6 rows) 
                     QUERY PLAN                  
------------------------------------------------------------------------------------------------------------------------------------------------------ 
Limit (cost=0.43..28.11 rows=10 width=16) (actual time=0.085..0.115 rows=10 loops=1) 
    -> Index Only Scan using idx_company_id_date_created on item (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.084..0.108 rows=10 loops=1) 
     Index Cond: (company_id = 6) 
     Heap Fetches: 10 
Planning time: 0.136 ms 
Execution time: 0.155 ms 
(6 rows)

在您的服務器可能會比較慢，但在任何情況下比在上面的例子中要好得多。

來源

2017-07-13 06:02:18

PostgreSQL的指數不使用

回答

相關問題