2012-04-21 41 views
4

我試圖在Postgres中提交一個只返回不同元組的查詢。在我的示例查詢中,我不希望對於cluster_id/feed_id組合多次存在條目的重複條目。如果我做一個簡單:如何在PostgreSQL查詢中排列不同的元組

select distinct on (cluster_info.cluster_id, feed_id) 
    cluster_info.cluster_id, num_docs, feed_id, url_time 
    from url_info 
    join cluster_info on (cluster_info.cluster_id = url_info.cluster_id) 
    where feed_id in (select pot_seeder from potentials) 
    and num_docs > 5 and url_time > '2012-04-16'; 

我得到了這一點,但我也想組根據num_docs。所以,當我做到以下幾點:

select distinct on (cluster_info.cluster_id, feed_id) 
    cluster_info.cluster_id, num_docs, feed_id, url_time 
    from url_info join cluster_info 
    on (cluster_info.cluster_id = url_info.cluster_id) 
    where feed_id in (select pot_seeder from potentials) 
    and num_docs > 5 and url_time > '2012-04-16' 
    order by num_docs desc; 

我收到以下錯誤:

ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions 
LINE 1: select distinct on (cluster_info.cluster_id, feed_id) cluste... 

我想我明白爲什麼我收到錯誤(不能按元組,除非我明確地描述該組不知何故),但我該怎麼做?或者,如果我對錯誤的解釋不正確,是否有辦法實現我最初的目標?

回答

10

最左邊的ORDER BY項目不能不同意DISTINCT條款的項目。我引用the manual about DISTINCT

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

嘗試:

SELECT * 
FROM (
    SELECT DISTINCT ON (c.cluster_id, feed_id) 
      c.cluster_id, num_docs, feed_id, url_time 
    FROM url_info u 
    JOIN cluster_info c ON (c.cluster_id = u.cluster_id) 
    WHERE feed_id IN (SELECT pot_seeder FROM potentials) 
    AND num_docs > 5 
    AND url_time > '2012-04-16' 
    ORDER BY c.cluster_id, feed_id, num_docs, url_time 
      -- first columns match DISTINCT 
      -- the rest to pick certain values for dupes 
      -- or did you want to pick random values for dupes? 
    ) x 
ORDER BY num_docs DESC; 

或者使用GROUP BY

SELECT c.cluster_id 
    , num_docs 
    , feed_id 
    , url_time 
FROM url_info u 
JOIN cluster_info c ON (c.cluster_id = u.cluster_id) 
WHERE feed_id IN (SELECT pot_seeder FROM potentials) 
AND num_docs > 5 
AND url_time > '2012-04-16' 
GROUP BY c.cluster_id, feed_id 
ORDER BY num_docs DESC; 

如果c.cluster_id, feed_id都(無論是在這種情況下)表的主鍵列,你包括從SELECT列表中列出,然後這隻適用於PostgreSQL 9.1或更高版本。

否則您需要GROUP BY其餘列或聚合或提供更多信息。

+0

我想我需要GROUP BY,因爲我提到了第二個答案:ERROR:列「c.num_docs」必須出現在GROUP BY子句中或用於聚合函數 – WildBill 2012-04-21 21:25:38

+0

Your第一個答案給出以下錯誤:錯誤:SELECT DISTINCT ON表達式必須匹配初始ORDER BY表達式 LINE 3:SELECT DISTINCT ON(c.cluster_id,feed_id) – WildBill 2012-04-21 21:26:08

+0

@WildBill:您可能錯過了第一個查詢的更新。我在我的第一個版本中修正了一個錯誤。至於第二個查詢:如果您提供缺少的信息哪些列屬於哪個表以及哪些主鍵和您的PostgreSQL版本,我的答案可能更具體。 – 2012-04-21 22:21:57