2011-12-15 69 views
13

的字符匹配的百分比比方說,我有一組2個字:T-SQL獲得2串

亞歷山大和Alecsander OR亞歷山大和Alegzander

亞歷山大和Aleaxnder,或任何其他組合。總的來說,我們在談論輸入單詞或單詞時出現人爲錯誤。

我想要達到的是獲得兩個字符串的字符匹配的百分比。

這是我到目前爲止有:

DECLARE @table1 TABLE 
(
    nr INT 
    , ch CHAR 
) 

DECLARE @table2 TABLE 
(
    nr INT 
    , ch CHAR 
) 


INSERT INTO @table1 
SELECT nr,ch FROM [dbo].[SplitStringIntoCharacters] ('WORD w') --> return a table of characters(spaces included) 

INSERT INTO @table2 
SELECT nr,ch FROM [dbo].[SplitStringIntoCharacters] ('WORD 5') 

DECLARE @resultsTable TABLE 
( 
ch1 CHAR 
, ch2 CHAR 
) 
INSERT INTO @resultsTable 
SELECT DISTINCt t1.ch ch1, t2.ch ch2 FROM @table1 t1 
FULL JOIN @table2 t2 ON t1.ch = t2.ch --> returns both matches and missmatches 

SELECT * FROM @resultsTable 
DECLARE @nrOfMathches INT, @nrOfMismatches INT, @nrOfRowsInResultsTable INT 
SELECT @nrOfMathches = COUNT(1) FROM @resultsTable WHERE ch1 IS NOT NULL AND ch2 IS NOT NULL 
SELECT @nrOfMismatches = COUNT(1) FROM @resultsTable WHERE ch1 IS NULL OR ch2 IS NULL 


SELECT @nrOfRowsInResultsTable = COUNT(1) FROM @resultsTable 


SELECT @nrOfMathches * 100/@nrOfRowsInResultsTable 

SELECT * FROM @resultsTable將返回以下內容:

ch1   ch2 
NULL  5 
[blank]  [blank] 
D   D 
O   O 
R   R 
W   W 
+0

它有什麼問題?該代碼工作正確嗎? – 2011-12-15 11:12:25

+0

這是不準確的。 – 2011-12-15 12:10:56

回答

20

好吧,這裏是迄今爲止我的解決方案:

SELECT [dbo].[GetPercentageOfTwoStringMatching]('valentin123456' ,'valnetin123456') 

回報86%

CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching] 
(
    @string1 NVARCHAR(100) 
    ,@string2 NVARCHAR(100) 
) 
RETURNS INT 
AS 
BEGIN 

    DECLARE @levenShteinNumber INT 

    DECLARE @string1Length INT = LEN(@string1) 
    , @string2Length INT = LEN(@string2) 
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END 

    SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] ( @string1 ,@string2) 

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100/@maxLengthNumber 

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters 

    -- Return the result of the function 
    RETURN @percentageOfGoodCharacters 

END 




-- =============================================  
-- Create date: 2011.12.14 
-- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html 
-- ============================================= 

CREATE FUNCTION [dbo].[LEVENSHTEIN](@left VARCHAR(100), 
            @right VARCHAR(100)) 
returns INT 
AS 
    BEGIN 
     DECLARE @difference INT, 
       @lenRight  INT, 
       @lenLeft  INT, 
       @leftIndex  INT, 
       @rightIndex INT, 
       @left_char  CHAR(1), 
       @right_char CHAR(1), 
       @compareLength INT 

     SET @lenLeft = LEN(@left) 
     SET @lenRight = LEN(@right) 
     SET @difference = 0 

     IF @lenLeft = 0 
     BEGIN 
      SET @difference = @lenRight 

      GOTO done 
     END 

     IF @lenRight = 0 
     BEGIN 
      SET @difference = @lenLeft 

      GOTO done 
     END 

     GOTO comparison 

     COMPARISON: 

     IF (@lenLeft >= @lenRight) 
     SET @compareLength = @lenLeft 
     ELSE 
     SET @compareLength = @lenRight 

     SET @rightIndex = 1 
     SET @leftIndex = 1 

     WHILE @leftIndex <= @compareLength 
     BEGIN 
      SET @left_char = substring(@left, @leftIndex, 1) 
      SET @right_char = substring(@right, @rightIndex, 1) 

      IF @left_char <> @right_char 
       BEGIN -- Would an insertion make them re-align? 
        IF(@left_char = substring(@right, @rightIndex + 1, 1)) 
        SET @rightIndex = @rightIndex + 1 
        -- Would an deletion make them re-align? 
        ELSE IF(substring(@left, @leftIndex + 1, 1) = @right_char) 
        SET @leftIndex = @leftIndex + 1 

        SET @difference = @difference + 1 
       END 

      SET @leftIndex = @leftIndex + 1 
      SET @rightIndex = @rightIndex + 1 
     END 

     GOTO done 

     DONE: 

     RETURN @difference 
    END 
+0

所以你發佈了一個沒有問題的問題^^ – 2011-12-15 11:25:33

8

最終,你似乎正在尋找解決兩個字符串彼此「模糊」匹配的可能性。

SQL提供了高效,優化的內置函數,可以爲您做到這一點,並且可能比您編寫的代碼具有更好的性能。您正在尋找的兩個功能是SOUNDEXDIFFERENCE

雖然他們都沒有解決的問題你所要求的 - 即他們沒有返回百分比匹配 - 我相信他們解決了你最終要實現的目標。

SOUNDEX返回一個4個字符的代碼,它是該單詞的第一個字母加上代表該單詞的聲音模式的3個數字代碼。考慮以下內容:

SELECT SOUNDEX('Alexander') 
SELECT SOUNDEX('Alegzander') 
SELECT SOUNDEX('Owleksanndurr') 
SELECT SOUNDEX('Ulikkksonnnderrr') 
SELECT SOUNDEX('Jones') 

/* Results: 

A425 
A425 
O425 
U425 
J520 

*/ 

你會注意到,三位數字425對於大致聽起來相似的那些是相同的。所以你可以很容易地把它們匹配起來,並說「你輸入'Owleksanndurr',你是說'亞歷山大'嗎?」

此外,還有DIFFERENCE函數,該函數比較兩個字符串之間的SOUNDEX差異並給它一個分數。

SELECT DIFFERENCE( 'Alexander','Alexsander') 
SELECT DIFFERENCE( 'Alexander','Owleksanndurr') 
SELECT DIFFERENCE( 'Alexander', 'Jones') 
SELECT DIFFERENCE( 'Alexander','ekdfgaskfalsdfkljasdfl;jl;asdj;a') 

/* Results: 

4 
3 
1 
1  

*/ 

正如您所看到的,得分越低(介於0和4之間),字符串匹配的可能性就越大。

SOUNDEX超過DIFFERENCE的好處是,如果你真的需要做頻繁的模糊匹配,你可以在一個單獨的(可轉位),列存儲和索引SOUNDEX數據,而DIFFERENCE只能在比較時計算SOUNDEX