發現重複不在一個MySQL表重複

我已經建立了一個表如下發現重複不在一個MySQL表重複

id 
origin 
destination 
carrier_id

這樣典型的行可能是，

100: London Manchester 366

現在每個路徑是雙向的，所以不應該有這樣的

233: Manchester London 366

一排因爲這是基本相同的路線（爲我的目的無論如何）

不幸的是，我已經結束了一些重複。我有超過5萬條路線，由大約2000點的起點（或目的地，但是你想看看它）組成。所以我想通過每個起點來循環查找重複是瘋了。

所以我甚至不知道從哪裏開始嘗試找出查詢來識別它們。有任何想法嗎？

來源

2009-07-02 gargantuan

我認爲你只需要一個雙連接，下面將標識所有連接在一起的「重複」記錄。

下面是一個例子。

說SELECT * FROM FLIGHTS產生：

id origin destination carrierid 
1 toronto quebec  1 
2 quebec toronto  2 
3 edmonton calgary  3 
4 calgary edmonton 4 
5 hull  vancouver 5 
6 vancouveredmonton 6 
7 edmonton toronto  7 
9 edmonton quebec  8 
10 toronto edmonton 9 
11 quebec edmonton 10 
12 calgary lethbridge 11

因此，有一堆重複的（路由的4一些其他路線的重複）。

select * 
from flights t1 inner join flights t2 on t1.origin = t2.destination 
     AND t2.origin = t1.destination

只會產生重複：

id origin destination carrierid id origin destination carrierid 
1 toronto quebec  1 2 quebec toronto 2 
2 quebec toronto  2 1 toronto quebec 1 
3 edmonton calgary 3 4 calgary edmonton 4 
4 calgary edmonton 4 3 edmonton calgary 3 
7 edmonton toronto 7 10 toronto edmonton 9 
9 edmonton quebec 8 11 quebec edmonton 10 
10 toronto edmonton 9 7 edmonton toronto 7 
11 quebec edmonton 10 9 edmonton quebec 8

在這一點上，你也許會刪除發生1日所有的人。

delete from flights 
where id in (
    select t1.id 
    from flights t1 inner join flights t2 on t1.origin = t2.destination 
      AND t2.origin = t1.destination 
)

祝你好運！

來源

2009-07-03 00:01:19 Tyler

我想你可能還需要在你的連接語句中使用「AND t1.carrier = t2.carrier」。我懷疑，如果兩個方向都是由同一個承運人提供的話，那隻會是一次重複的旅程。希望OP能澄清匹配規則。 – Convict 2009-07-03 01:03:40

無賴！把我的頭頂部（和僞SQL）：

select * from (
    select id, concat(origin, '_', destination, '_', carrier_id) as key from .... 
    union 
    select id, concat(destination, '_', origin, '_', carrier_id) as key from .... 

) having count(key) > 1;

對於上面的記錄，你會結束：

100, London_Manchester_366 
100, Manchester_Longer_366 
233 Manchester_London_366 
233 London_Manchester_366

這是真的，真的hackish的，並且不準確地告訴你你在做什麼 - 它只會縮小它的範圍。也許它會給你一個起點？也許它會給別人一些想法，他們可以提供幫助你。

來源

2009-07-02 23:33:55

如果你不介意一點點shell腳本，如果你可以在你這裏顯示...這裏的形式輸入的轉儲是我的樣品輸入：

100: London Manchester 366 
121: London CityA 240 
144: Manchester CityA 300 
150: CityA CityB 90 
233: Manchester London 366

你可能可以做這樣的事情：

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort 
CityA CityB 150: 
CityA London 121: 
CityA Manchester 144: 
London Manchester 100: 
London Manchester 233:

所以你至少有雙組合在一起。不知道從那裏將是最好的舉措。

好，這是一個命令行的野獸：

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]" 
150:151:150:255:CityA CityB 
100:233:London Manchester

其中m.txt有這些新的內容：

100: London Manchester 366 
121: London CityA 240 
144: Manchester CityA 300 
150: CityA CityB 90 
151: CityB CityA 90 
233: Manchester London 366 
255: CityA CityB 90

Perl的可能會比AWK一個更好的選擇，但在這裏不用：首先，我們對兩個城市名稱進行排序，並將ID放在字符串的末尾，這是我在第一節中完成的。然後我們將這些對分組在一起，並且我們必須在awk腳本的最後一行添加完成。然後，我們遍歷文件中的每一行。如果我們看到一對新的城市，我們打印先前看到的城市，並存儲新城市和新ID。如果我們看到上次看到的同一城市，則會打印出上一行的ID和此行的ID。最後，我們只grep僅以行號開頭的行，以便丟棄非重複對。

如果一個對出現兩次以上，你會得到一個重複的ID，但這不是什麼大不了的事。

清除泥？

來源

2009-07-02 23:44:44

發現重複不在一個MySQL表重複

回答

相關問題