2017-03-18 44 views
0

我有以下FNAMES表(它包含約58K的記錄)拆分表格蜂巢

+------+-------------+ 
| ID | NICKNAMES | 
+------+-------------+ 
| 1 | Avile  | 
| 2 | Dudi  | 
| 3 | Moshiko | 
| 4 | Avi  | 
| 5 | DAVE  | 
.... 

我想通過所有包含相同的第一萊特的記錄拆表,像這樣:

+------+-------------+ 
| ID | NICKNAMES | 
+------+-------------+ 
| 1 | Avile  | 
| 4 | Avi  | 

| 2 | Dudi  | 
| 5 | DAVE  | 

| 3 | Moshiko | 
.... 

爲每個分割我想找到用最少的Jaro–Winkler distance記錄。這意味着每個以'a'開頭的字母都會找到最相似的記錄。 我必須在下面的代碼中更改什麼?

select FNAMES.* , MIN(Jaro–Winkler(FNAMES.NICKNAMES, FNAMES.NICKNAMES)) 
from FNAMES 
LEFT OUTER JOIN FNAMES 
ON(true) 
    WHERE Jaro–Winkler (FNAMES.NICKNAMES, FNAMES.NICKNAMES) <= 4 
GROUP BY FNAMES.NICKNAMES 

回答

1

像這樣的事情

select  f1.nicknames 
      ,f2.nicknames 

from  (select  f1.nicknames 
         ,f2.nicknames 
         ,rank() over 
         (
          partition by f1.nicknames 
          order by  jaro–winkler(f1.nicknames,f2.nicknames) desc 
         ) as rnk 

      from     fnames f1 

         left join fnames f2 

         on   substr(f1.nicknames,1,1) = 
            substr(f2.nicknames,1,1) 

      where  f1.nicknames < f2.nicknames 
      ) t 

where  rnk = 1 
+0

感謝嘟嘟!你能否特別爲f1.nicknames Avi

+1

**(1)**假設您有一個單列「x」和兩行「A」和「B」的表格。 't1.x <> t2.x'會重新將'A --- B'和'B --- A'重新調回。 't1.x