2016-07-28 46 views
0

我正在處理一個查詢問題,我正在努力解決問題。我有一個名字的數據庫。什麼我希望做的是找出那些誰在數據庫連接到相同的ID,在這些名字非常相似,彼此有多個名稱:識別數據庫中的類似字段(但不重複)

ID       Name 
-------------    ---------- 
123ABC      Joe Smith 

123ABC      Joseph Smith 

345XYZ      Michael Johnson 

345XYZ      MikeJohnson 

678LMN      Suzyjones 

678LMN      Suzanne Mary Jones 

所以我希望建立一個查詢可以識別這些人。任何人有任何建議或意見?顯然,這可能相當棘手,因爲我們不處理直接重複,而是小而細微的變化。

+0

修改標籤爲實際的數據庫 – dbmitch

+0

你看到http://stackoverflow.com/a/測試38513900/5221944?這個解決了這些細微差別(適用於BigQuery),並且可以輕鬆移植到您的新示例中。順便說一句 - 你能在你之前的問題中實現它嗎? –

+0

@dbmitch - 你是什麼意思? – wizkids121

回答

0

做一個自我加入ID匹配的位置和名稱不:

select t1.ID, t1.NAME, t2.NAME 
from your_table t1 
join your_table t2 
    on t1.ID = t2.ID 
and t1.NAME <> t2.NAME 
+0

對,但這不是我要找的。 這是關於識別數據庫中具有兩個彼此類似的不同名稱的ID。我知道如何找到這些ID,但是它找到了那些我有麻煩的名稱類型的例子。 – wizkids121

0

您可以通過多種方式實現這一點,我建議你去通過group by子句的路線。

下面的查詢假設您只有 的記錄,因此有一個名稱附加到ID上。

;WITH CTE AS 
(
SELECT ID 
FROM <yourTable> 
group by ID 
HAVING COUNT(1) > 1 
) 
SELECT T.* 
FROM CTE C 
JOIN <yourTable> T 
ON C.id - T.ID 

如果你有在同一個表中的多個同名行,那麼你只需要事先申請的不同條款。下面

0

檢查 - 應在查詢的結尾你
WHERE similarity > -1工作 - 通過設定值,而不是-1你可以控制的相似性閾值。越接近1,你想捕捉的對象就越相似。更接近0 - 更多對捕捉!

SELECT ID, Name1, Name2, similarity FROM 
JS(// input table 
(
    SELECT one.ID AS ID, one.Name AS Name1, two.Name AS Name2 
    FROM YourTable AS one 
    JOIN YourTable AS two ON one.ID = two.ID 
    HAVING Name1 < Name2 
) , 
// input columns 
ID, Name1, Name2, 
// output schema 
"[{name: 'ID', type:'string'}, 
    {name: 'Name1', type:'string'}, 
    {name: 'Name2', type:'string'}, 
    {name: 'similarity', type:'float'}] 
", 
// function 
"function(r, emit) { 

    var _extend = function(dst) { 
    var sources = Array.prototype.slice.call(arguments, 1); 
    for (var i=0; i<sources.length; ++i) { 
     var src = sources[i]; 
     for (var p in src) { 
     if (src.hasOwnProperty(p)) dst[p] = src[p]; 
     } 
    } 
    return dst; 
    }; 

    var Levenshtein = { 
    /** 
    * Calculate levenshtein distance of the two strings. 
    * 
    * @param str1 String the first string. 
    * @param str2 String the second string. 
    * @return Integer the levenshtein distance (0 and above). 
    */ 
    get: function(str1, str2) { 
     // base cases 
     if (str1 === str2) return 0; 
     if (str1.length === 0) return str2.length; 
     if (str2.length === 0) return str1.length; 

     // two rows 
     var prevRow = new Array(str2.length + 1), 
      curCol, nextCol, i, j, tmp; 

     // initialise previous row 
     for (i=0; i<prevRow.length; ++i) { 
     prevRow[i] = i; 
     } 

     // calculate current row distance from previous row 
     for (i=0; i<str1.length; ++i) { 
     nextCol = i + 1; 

     for (j=0; j<str2.length; ++j) { 
      curCol = nextCol; 

      // substution 
      nextCol = prevRow[j] + ((str1.charAt(i) === str2.charAt(j)) ? 0 : 1); 
      // insertion 
      tmp = curCol + 1; 
      if (nextCol > tmp) { 
      nextCol = tmp; 
      } 
      // deletion 
      tmp = prevRow[j + 1] + 1; 
      if (nextCol > tmp) { 
      nextCol = tmp; 
      } 

      // copy current col value into previous (in preparation for next iteration) 
      prevRow[j] = curCol; 
     } 

     // copy last col value into previous (in preparation for next iteration) 
     prevRow[j] = nextCol; 
     } 

     return nextCol; 
    } 

    }; 

    var the_Name1; 

    try { 
    the_Name1 = decodeURI(r.Name1).toLowerCase(); 
    } catch (ex) { 
    the_Name1 = r.Name1.toLowerCase(); 
    } 

    try { 
    the_Name2 = decodeURI(r.Name2).toLowerCase(); 
    } catch (ex) { 
    the_Name2 = r.Name2.toLowerCase(); 
    } 

    emit({ID: r.ID, Name1: the_Name1, Name2: the_Name2, 
     similarity: 1 - Levenshtein.get(the_Name1, the_Name2)/the_Name1.length}); 

    }" 
) 
WHERE similarity > -1 
ORDER BY similarity DESC 

你可以用下面的例子

SELECT ID, Name1, Name2, similarity FROM 
JS(// input table 
(
    SELECT one.ID AS ID, one.Name AS Name1, two.Name AS Name2 
    FROM (
    SELECT ID, Name FROM 
     (SELECT '123ABC' AS ID, 'Joe Smith' AS Name), 
     (SELECT '123ABC' AS ID, 'Joseph Smith' AS Name), 
     (SELECT '345XYZ' AS ID, 'Michael Johnson' AS Name), 
     (SELECT '345XYZ' AS ID, 'MikeJohnson' AS Name), 
     (SELECT '678LMN' AS ID, 'Suzyjones' AS Name), 
     (SELECT '678LMN' AS ID, 'Suzanne Mary Jones' AS Name), 
     (SELECT 'AAA' AS ID, 'Jordan Tigani' AS Name), 
     (SELECT 'AAA' AS ID, 'Felipe Hoffa' AS Name), 
     (SELECT 'BBB' AS ID, 'Mikhail Berlyant' AS Name), 
     (SELECT 'BBB' AS ID, 'Michael Sheldon' AS Name), 
) AS one 
    JOIN (
    SELECT ID, Name FROM 
     (SELECT '123ABC' AS ID, 'Joe Smith' AS Name), 
     (SELECT '123ABC' AS ID, 'Joseph Smith' AS Name), 
     (SELECT '345XYZ' AS ID, 'Michael Johnson' AS Name), 
     (SELECT '345XYZ' AS ID, 'MikeJohnson' AS Name), 
     (SELECT '678LMN' AS ID, 'Suzyjones' AS Name), 
     (SELECT '678LMN' AS ID, 'Suzanne Mary Jones' AS Name), 
     (SELECT 'AAA' AS ID, 'Jordan Tigani' AS Name), 
     (SELECT 'AAA' AS ID, 'Felipe Hoffa' AS Name), 
     (SELECT 'BBB' AS ID, 'Mikhail Berlyant' AS Name), 
     (SELECT 'BBB' AS ID, 'Michael Sheldon' AS Name), 
) AS two 
    ON one.ID = two.ID 
    HAVING Name1 < Name2 
) , 
// input columns 
ID, Name1, Name2, 
// output schema 
"[{name: 'ID', type:'string'}, 
    {name: 'Name1', type:'string'}, 
    {name: 'Name2', type:'string'}, 
    {name: 'similarity', type:'float'}] 
", 
// function 
"function(r, emit) { 

    var _extend = function(dst) { 
    var sources = Array.prototype.slice.call(arguments, 1); 
    for (var i=0; i<sources.length; ++i) { 
     var src = sources[i]; 
     for (var p in src) { 
     if (src.hasOwnProperty(p)) dst[p] = src[p]; 
     } 
    } 
    return dst; 
    }; 

    var Levenshtein = { 
    /** 
    * Calculate levenshtein distance of the two strings. 
    * 
    * @param str1 String the first string. 
    * @param str2 String the second string. 
    * @return Integer the levenshtein distance (0 and above). 
    */ 
    get: function(str1, str2) { 
     // base cases 
     if (str1 === str2) return 0; 
     if (str1.length === 0) return str2.length; 
     if (str2.length === 0) return str1.length; 

     // two rows 
     var prevRow = new Array(str2.length + 1), 
      curCol, nextCol, i, j, tmp; 

     // initialise previous row 
     for (i=0; i<prevRow.length; ++i) { 
     prevRow[i] = i; 
     } 

     // calculate current row distance from previous row 
     for (i=0; i<str1.length; ++i) { 
     nextCol = i + 1; 

     for (j=0; j<str2.length; ++j) { 
      curCol = nextCol; 

      // substution 
      nextCol = prevRow[j] + ((str1.charAt(i) === str2.charAt(j)) ? 0 : 1); 
      // insertion 
      tmp = curCol + 1; 
      if (nextCol > tmp) { 
      nextCol = tmp; 
      } 
      // deletion 
      tmp = prevRow[j + 1] + 1; 
      if (nextCol > tmp) { 
      nextCol = tmp; 
      } 

      // copy current col value into previous (in preparation for next iteration) 
      prevRow[j] = curCol; 
     } 

     // copy last col value into previous (in preparation for next iteration) 
     prevRow[j] = nextCol; 
     } 

     return nextCol; 
    } 

    }; 

    var the_Name1; 

    try { 
    the_Name1 = decodeURI(r.Name1).toLowerCase(); 
    } catch (ex) { 
    the_Name1 = r.Name1.toLowerCase(); 
    } 

    try { 
    the_Name2 = decodeURI(r.Name2).toLowerCase(); 
    } catch (ex) { 
    the_Name2 = r.Name2.toLowerCase(); 
    } 

    emit({ID: r.ID, Name1: the_Name1, Name2: the_Name2, 
     similarity: 1 - Levenshtein.get(the_Name1, the_Name2)/the_Name1.length}); 

    }" 
) 
WHERE similarity > -1 
ORDER BY similarity DESC 

它產生以下結果

ID   Name1    Name2    similarity 
123ABC  joe smith   joseph smith  0.6666666666666667 
345XYZ  michael johnson  mikejohnson   0.6666666666666667 
678LMN  suzanne mary jones suzyjones   0.5 
BBB   michael sheldon  mikhail berlyant 0.4666666666666667 
AAA   felipe hoffa  jordan tigani  0.0