以優先順序聚合SQL行

我有一張桌滿了來自不同來源的項目。一些來源可能具有相同的位置（在我的示例中，不同的BBC新聞提要將是不同的來源，但它們全部來自BBC）。每個項目都有一個「唯一」ID，可用於在同一位置識別其他項目。這意味着與網站上同一個新聞故事相關的項目，但在不同的Feed中發佈的項目將具有相同的「唯一ID」，但這不一定是全球唯一的。以優先順序聚合SQL行

問題是我想在顯示時間消除重複項，以便（取決於您看到的是哪些Feed）只能獲得每個故事的最多一個版本，即使兩個或三個供稿可能包含鏈接到它。

我有一個sources表與關於每個來源的信息，location_id和location_precedence字段。然後我有一個包含每個項目的items表，它的unique_id,source_id和content。具有相同unique_id和來源location_id的項目應該最多顯示一次，最高來源location_precedence獲勝。

我本來以爲是這樣的：

SELECT `sources`.`name` AS `source`, 
     `items`.`content`, 
     `items`.`published` 
FROM `items` INNER JOIN `sources` 
    ON `items`.`source_id` = `sources`.`id` AND `sources`.`active` = 1 
GROUP BY `items`.`unique_id`, `sources`.`location_id` 
ORDER BY `sources`.`location_priority` DESC

會做的伎倆，但似乎忽略了位置優先級字段。我錯過了什麼？

示例數據：

CREATE TABLE IF NOT EXISTS `sources` (
    `id` int(10) unsigned NOT NULL auto_increment, 
    `location_id` int(10) unsigned NOT NULL, 
    `location_priority` int(11) NOT NULL, 
    `active` tinyint(1) unsigned NOT NULL default '1', 
    `name` varchar(150) NOT NULL, 
    `url` text NOT NULL, 
    PRIMARY KEY (`id`), 
    KEY `active` (`active`) 
); 

INSERT INTO `sources` (`id`, `location_id`, `location_priority`, `active`, `name`, `url`) VALUES 
(1, 1, 25, 1, 'BBC News Front Page', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml'), 
(2, 1, 10, 1, 'BBC News England', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml'), 
(3, 1, 15, 1, 'BBC Technology News', 'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/technology/rss.xml'), 
(4, 2, 0, 1, 'Slashdot', 'http://rss.slashdot.org/Slashdot/slashdot'), 
(5, 3, 0, 1, 'The Daily WTF', 'http://syndication.thedailywtf.com/TheDailyWtf'); 

CREATE TABLE IF NOT EXISTS `items` (
    `id` bigint(20) unsigned NOT NULL auto_increment, 
    `source_id` int(10) unsigned NOT NULL, 
    `published` datetime NOT NULL, 
    `content` text NOT NULL, 
    `unique_id` varchar(255) NOT NULL, 
    PRIMARY KEY (`id`), 
    UNIQUE KEY `unique_id` (`unique_id`,`source_id`), 
    KEY `published` (`published`), 
    KEY `source_id` (`source_id`) 
); 

INSERT INTO `items` (`id`, `source_id`, `published`, `content`, `unique_id`) VALUES 
(1, 1, '2009-12-01 16:25:53', 'Story about Subject One',      'abc'), 
(2, 2, '2009-12-01 16:21:31', 'Subject One in story',      'abc'), 
(3, 3, '2009-12-01 16:17:20', 'Techy goodness',        'def'), 
(4, 2, '2009-12-01 16:05:57', 'Further updates on Foo case',     'ghi'), 
(5, 3, '2009-12-01 15:53:39', 'Foo, Bar and Quux in court battle',   'ghi'), 
(6, 2, '2009-12-01 15:52:02', 'Anti-Fubar protests cause disquiet',   'mno'), 
(7, 4, '2009-12-01 15:39:00', 'Microsoft Bleh meets lukewarm reception',  'pqr'), 
(8, 5, '2009-12-01 15:13:45', 'Ever thought about doing it in VB?',   'pqr'), 
(9, 1, '2009-12-01 15:13:15', 'Celebrity has &#039;new friend&#039;',  'pqr'), 
(10, 1, '2009-12-01 15:09:57', 'Microsoft launches Bleh worldwide',   'stu'), 
(11, 2, '2009-12-01 14:57:22', 'Microsoft launches Bleh in UK',    'stu'), 
(12, 3, '2009-12-01 14:57:22', 'Microsoft launches Bleh',      'stu'), 
(13, 3, '2009-12-01 14:42:15', 'Tech round-up',        'vwx'), 
(14, 2, '2009-12-01 14:36:26', 'Estates &#039;old news&#039; say government', 'yza'), 
(15, 1, '2009-12-01 14:15:21', 'Iranian doctor &#039;was poisoned&#039;',  'bcd'), 
(16, 4, '2009-12-01 14:14:02', 'Apple fans overjoyed by iBlah',    'axf');

查詢後所期望的內容：

故事有關主題的一個
易怒善
富，酒吧和QUUX在法庭鬥爭
反富足r抗議造成不安
微軟Bleh遇見溫柔接待
曾經想過在VB中做這件事嗎？
名人有「新朋友」
微軟推出的Bleh全球
技術圍捕
莊園「老新聞」說，政府
伊朗醫生是被毒死的「
蘋果迷們喜出望外通過iBlah

我試着通過Andomar解決方案的變化，一些成功：

SELECT  s.`name` AS `source`, 
      i.`content`, 
      i.`published` 
FROM  `items` i 
INNER JOIN `sources` s 
ON   i.`source_id` = s.`id` 
AND   s.`active` = 1 
INNER JOIN (
    SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
    FROM `items` i 
    INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1 
    GROUP BY `location_id`, `unique_id` 
) `filter` 
ON   i.`unique_id` = `filter`.`unique_id` 
AND   s.`location_priority` = `filter`.`prio` 
ORDER BY i.`published` DESC 
LIMIT 50

隨着AND s.location_priority = filter.prio東西幾乎工作，因爲我想。因爲一個項目可以來自多個來源具有相同的優先級，項目可以重複。在這種情況下，外部查詢需要額外的GROUP BY i.unique_id來完成這項工作，如果優先級相同，我認爲哪個源「勝出」並不重要。

我曾試過用AND i.source_id = filter.source_id代替，它幾乎可以工作（即消除了額外的GROUP BY），但沒有給出正確來源的結果。在上面的例子中，它給了我「Foo case的進一步更新」（來源於「BBC News England」），而不是「Foo，Bar and Quux在法庭上的戰鬥」（來源於「BBC技術新聞」）。查詢時，我得到：

unique_id: 'ghi' 
source_id: 2 
prio: 15

注意源ID是不正確的（預期：3）。

來源

2009-12-06 DMI

你可以爲了通過不包含在GROUP BY列的location_priority「的文章？ – 2009-12-06 13:06:12

@Yonatan Karni：在MySQL中，你可以。它的行爲就像一個'any（）'聚合函數:) – Andomar 2009-12-06 13:32:55

另請參見：http://stackoverflow.com/questions/1438978/sql-query-to-get-max-value-based-on-different-max- value-given-multiple-records，http://stackoverflow.com/questions/95866/select-max-in-group，http://stackoverflow.com/questions/1299556/sql-group-by-max，http： //stackoverflow.com/questions/1305056/mysql-selecting-all-corresponding-fields-using-max-and-group-by，http://stackoverflow.com/questions/526143/group-by-max，http： //stackoverflow.com/questions/1339624/sql-select-unique-rows-from-a-group-of-results，可能還有其他人。 – outis 2009-12-06 14:16:02

Order by只是命令行，它並不挑選其中。

其中一個過濾掉的行具有較低的方法是使用一個inner join作爲過濾器：

SELECT  s.name, i.content, i.published 
FROM  items i 
INNER JOIN sources s 
ON   i.source_id = s.id 
AND  s.active = 1 
INNER JOIN (
    SELECT unique_id, max(location_priority) as prio 
    FROM items i 
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1 
    GROUP BY unique_id) filter 
ON   i.unique_id = filter.unique_id 
AND  s.location_priority = filter.prio;

一種替代方案是一個where ... in <subquery>子句，例如：

SELECT  s.name, i.content, i.published 
FROM  items i 
INNER JOIN sources s 
ON   i.source_id = s.id 
AND  s.active = 1 
WHERE  (i.unique_id, s.location_priority) IN (
    SELECT unique_id, max(location_priority) 
    FROM items i 
    INNER JOIN sources s ON s.id = i.source_id AND s.active = 1 
    GROUP BY unique_id 
);

此問題也被稱爲「選擇保持組範圍最大值的記錄」。 Quassnoi已經寫了nice article。

編輯：一個方法來打破與多個來源在同一優先級的關係是WHERE子句與子查詢。這個例子破壞上i.id DESC關係：

SELECT  s.name, i.unique_id, i.content, i.published 
FROM  (
      SELECT unique_id, min(location_priority) as prio 
      FROM items i 
      INNER JOIN sources s ON s.id = i.source_id AND s.active = 1 
      GROUP BY unique_id 
      ) filter 
JOIN  items i 
JOIN  sources s 
ON   s.id = i.source_id 
      AND s.active = 1 
WHERE  i.id = 
      (
      SELECT i.id 
      FROM  items i 
      JOIN  sources s 
      ON  s.id = i.source_id 
        AND s.active = 1 
      WHERE i.unique_id = filter.unique_id 
      AND  s.location_priority = filter.prio 
      ORDER BY i.id DESC 
      LIMIT 1 
      )

Quassnoi也有selecting records holding group-wise maximum (resolving ties) :)

來源

2009-12-06 13:17:21 Andomar

謝謝！文章（並知道如何描述問題）非常有用。 – DMI 2009-12-06 13:42:07

另請參閱：http://dev.mysql.com/doc/refman/5.1/en/example-maximum-column-group-row.html – outis 2009-12-06 13:52:23

Argh。所以我試過這個解決方案，但它似乎沒有工作。我已經更新了主要帖子的細節。 – DMI 2009-12-06 22:30:16

做一個自我加盟派生表像

select max(location_priority) from table where ...

來源

2009-12-06 12:58:47

什麼都有我錯過了嗎？

ORDER BY發生在GROUP BY已經將每個組縮減爲一行之後發生。保羅給出了一個決議。

至於與查詢問題：

SELECT `unique_id`, `source_id`, MAX(`location_priority`) AS `prio` 
FROM `items` i 
INNER JOIN `sources` s ON s.`id` = i.`source_id` AND s.`active` = 1 
GROUP BY `location_id`, `unique_id`

source_id既不聚集也不分組。因此，你得到的價值是不確定的。

來源

2009-12-06 13:12:29 outis

這不起作用：您不能在HAVING子句中使用非聚合列。即使可以，這也會隱藏所有具有非高優先級的不活動源的故事。 – Andomar 2009-12-06 13:29:08

@Andormar：在MySQL中，你可以。加入確保從不考慮最高優先級的不活動源。真正的問題是，在GROUP BY之後顯然有HAVING過濾減少了行數。 – outis 2009-12-06 13:44:19

@outis：我認爲你可以在SELECT中使用它們，但是在'HAVING'中他們會給出'未知列'錯誤 – Andomar 2009-12-06 13:47:35

以優先順序聚合SQL行

回答

相關問題