2017-03-03 77 views
0

我試圖讓查詢得到每年的計數,當年的電影數量已經表明所有人都不是男性(每年,計算的電影數量一年沒有男性)。使用NOT的性能下降

這些都是表:

ACTOR (id, fname, lname, gender) 
MOVIE (id, name, year) 
CASTS (pid, mid, role) -- pid refers to actor id, mid refers to movie id 

這是我索引(id這些表是主鍵,所以他們已經編入索引,還是讓我假設):

CREATE INDEX gender_index on actor(gender); 
CREATE INDEX movie_name_index on movie(name); 
CREATE INDEX movie_year_index on movie(year); 
CREATE INDEX casts_index on casts(pid, mid, role); 
CREATE INDEX casts_pid_index on casts(pid); 
CREATE INDEX casts_mid_index on casts(mid); 
CREATE INDEX casts_role_index on casts(role); 

這是我的查詢:

SELECT m.year, count(m.id) 
FROM movie as m 
WHERE m.id NOT IN (
    SELECT DISTINCT m.id 
    FROM movie as m, casts as c, actor as a 
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M' 
) 
GROUP BY m.year 
ORDER BY m.year 

該查詢需要永遠(並從未完成),所以我怎麼能讓這個快點?是否使用NOT EXISTS幫助,雖然我認爲優化器處理此問題?我需要索引其他東西嗎?是否有另一個更好的查詢?如果這有什麼不同,我使用PostgreSQL。

這裏是EXPLAIN

"GroupAggregate (cost=1512539.61..171886457832.52 rows=61 width=8)" 
" Group Key: m.year" 
" -> Index Scan using movie_year_index on movie m (cost=1512539.61..171886453988.38 rows=768706 width=8)" 
"  Filter: (NOT (SubPlan 1))" 
"  SubPlan 1" 
"   -> Materialize (cost=1512539.18..1732298.66 rows=1537411 width=4)" 
"    -> Unique (cost=1512539.18..1718605.60 rows=1537411 width=4)" 
"      -> Merge Join (cost=1512539.18..1700559.32 rows=7218511 width=4)" 
"       Merge Cond: (m_1.id = c.mid)" 
"       -> Index Only Scan using movie_pkey on movie m_1 (cost=0.43..57863.94 rows=1537411 width=4)" 
"       -> Materialize (cost=1512531.37..1548623.92 rows=7218511 width=4)" 
"         -> Sort (cost=1512531.37..1530577.65 rows=7218511 width=4)" 
"          Sort Key: c.mid" 
"          -> Hash Join (cost=54546.59..492838.95 rows=7218511 width=4)" 
"            Hash Cond: (c.pid = a.id)" 
"            -> Seq Scan on casts c (cost=0.00..186246.43 rows=11445843 width=8)" 
"            -> Hash (cost=35248.91..35248.91 rows=1176214 width=4)" 
"             -> Seq Scan on actor a (cost=0.00..35248.91 rows=1176214 width=4)" 
"               Filter: ((gender)::text = 'M'::text)" 
+5

什麼是'EXPLAIN'計劃是什麼樣子? – jmelesky

+0

添加了'EXPLAIN'輸出。 – Jack

+0

1)而不是一堆(非唯一)索引:爲表定義一些主鍵和外鍵。在適當的地方使它們不爲NULL。 2)'真空分析'所有涉及的表格。 – joop

回答

3

我會嘗試

SELECT m.year, count(m.id) 
FROM movie m 
WHERE NOT EXISTS (
    SELECT NULL 
    FROM casts c, actor a 
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M' 
) 
GROUP BY m.year 
ORDER BY m.year 
+0

爲什麼'NOT EXISTS'比'NOT IN'好?查詢優化器不應該優化兩者並使它們沒有區別嗎? – Jack

+0

[看這個問題/答案](http://stackoverflow.com/questions/24929/difference-between-exists-and-in-in-sql) –

+0

@傑克:檢查執行計劃,你會知道 –

2

首先,使用適當的明確JOIN語法。二,使用相關子查詢,而不是NOT IN

SELECT m.year, count(m.id) 
FROM movie m 
WHERE NOT EXISTS (SELECT 
        FROM casts c JOIN 
         actor a 
         ON a.id = c.pid 
        WHERE m.id = c.mid AND a.gender = 'M' 
       ) 
GROUP BY m.year 
ORDER BY m.year; 

然而,我的傾向是使用有條件聚集:

SELECT m.year, SUM(CASE WHEN num_m = 0 THEN 1 ELSE 0 END) as cnt 
FROM (SELECT m.id, m.year, 
      SUM(CASE WHEN a.gender = 'M' THEN 1 ELSE 0 END) as num_m 
     FROM movie m JOIN 
      casts c 
      ON m.id = c.mid JOIN 
      actor a 
      ON a.id = c.pid 
     GROUP BY m.id, m.year 
    ) m 
GROUP BY m.year 
ORDER BY m.year; 
+0

這讓我困惑,爲什麼使用相關的子查詢會比'NOT IN'更好地工作。對外部查詢處理的每一行評估內部查詢效率是否更低效?而且,查詢優化器不應該優化'NOT IN',以便與'NOT EXISTS'具有相同的查詢計劃嗎? – Jack

+0

@Jack。 。 。您的查詢版本幾乎排除了優化器使用'NOT IN'索引。但更重要的是,如果任何返回的值爲NULL,'NOT IN'通常不會做你想要的,所以我的習慣是在子查詢中使用NOT EXISTS。 –

0

使用IN進行比NOT IN好得多。爲什麼不修改您的查詢以包含記錄而不是排除它們?

因此,而不是男性排除在查詢......

SELECT m.year, count(m.id) 
FROM movie as m 
WHERE m.id NOT IN (
    SELECT DISTINCT m.id 
    FROM movie as m, casts as c, actor as a 
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M' 
) 
GROUP BY m.year 
ORDER BY m.year 

只需選擇女性..

SELECT m.year, count(m.id) 
FROM movie as m 
WHERE m.id IN (
    SELECT DISTINCT m.id 
    FROM movie as m, casts as c, actor as a 
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'F' 
) 
GROUP BY m.year 
ORDER BY m.year