字符串相似性 - > Levenshtein距離

我使用Levenshtein算法來查找兩個字符串之間的相似度。這是我正在製作的計劃中非常重要的一部分，所以它需要有效。的問題是，該算法沒有找到下面的例子類似：字符串相似性 - > Levenshtein距離

CONAIR
AIRCON

該算法將給予6的距離所以對於6個字母這個詞（您查看字母數量最多的單詞），差異爲100％=>相似性爲0％。

我需要找到一種方法來找到兩個字符串之間的相似性，但也考慮到像我之前介紹過的情況。

有沒有更好的算法，我可以使用？或者你們推薦我什麼？

編輯：我也看着「Damerau-Levenshtein」算法，它增加了換位。問題是這種換位僅適用於相鄰的字符（而不適用於多個字符）。

2012-07-26 Fede Lerner

在找出字符串距離算法之前，您需要清楚地定義您認爲可接受的轉換類型。是什麼讓這些字符串比兩個隨機的6個字母的字符串更相似？你能用這樣一種方式來表達它，你可以從一根繩子爬到另一根繩子上，每一步都變得更加相似？ – 2012-07-27 05:54:04

我會將這個詞分爲unigrams，bigrams和trigrams，然後計算餘弦相似度。

來源

2012-07-26 18:00:40 maniek

對於任何需要幫助的人如何實際執行此操作.. https://gist.github.com/darcy/2896009 – keithhackbarth 2014-04-14 22:13:19

** @ keithhackbarth的解決方案對MongoDB有很大的依賴性**真的很感謝獨立的解決方案，並且最好是有文化的。 – 2016-02-11 12:12:28

這聽起來像你可能想嘗試使用音節或音素而不是字母來做Levenshtein距離。

來源

2012-07-26 18:04:14

我已經試過這種方法，使用音節。問題是當你找到兩個單詞時，音節根據它們在單詞中的位置而被不同地分開（不確定這是否是用英語分隔單詞的正確方式，我實際上是用西班牙語來做這個）。 CO NAIR AIR CON – 2012-07-26 18:48:21

從理論上講，您使用的方法對於您嘗試解決的問題是正確的。但萊文斯坦只會考慮兩套個人角色。

字符串相似性也可以使用Longest Common Subsequence方法找到，然後你可以看到其餘的無與倫比的Levenstein。

如果你想做一個集羣方法，the following answer似乎有一些細節，但顯然這是更難實施。

來源

2012-07-26 18:04:23

最長公共子序列方法與Levenshtein方法完全相同。 Levenshtein距離是琴絃長度和LCS長度之差的總和。 – reinierpost 2014-05-15 08:45:42

對單詞進行排序並找到Levenshtein將爲您的示例提供100％的匹配，但它也可以給出100％的匹配，例如，

CONAIR 
RCIAON

這可能不是你想要的。

定義相似性的另一種方法是找出2個字符串的常見子字符串。您可以創建一個Suffix Tree並找出所有常見的子字符串，並嘗試確定它們的相似程度。因此，對於你的後綴樹會給出通用的子字符串，如CON & AIR，它涵蓋了整個單詞（對於你的2個字符串）並因此得出它們類似的結論。

來源

2012-07-27 08:15:30 user1168577

我認爲這可以通過採用最長公共子串/後算法的一個字符串（如「Conair公司」）和其他字符串追加到本身一次（如「空調」迎刃而解 - >「airconaircon 「）。用C

樣品的編號：

#include <stdio.h> 
#include <string.h> 
#include <stdlib.h> 

// Returns the length of the longest common substring (LCS) 
// between two given strings. 
// 
// This recursive implementation can be replaced by a 
// more performant dynamic programming implementation. 
size_t llcs(const char* s1, const char* s2) 
{ 
    size_t len[3]; 

    if (*s1 == '\0' || *s2 == '\0') return 0; 

    len[0] = (*s1 == *s2) + llcs(s1 + 1, s2 + 1); 
    len[1] = llcs(s1 + 1, s2); 
    len[2] = llcs(s1, s2 + 1); 

    if (len[0] < len[1]) len[0] = len[1]; 
    if (len[0] < len[2]) len[0] = len[2]; 

    return len[0]; 
} 

// Returns similarity of two given strings in the range 
// from 0.0 to 1.0 (1.0 for equal strings). 
double similarity(const char* s1, const char* s2) 
{ 
    size_t s1len = strlen(s1); 
    size_t s2len = strlen(s2); 
    double sim; 

    if (s1len == 0 && s2len == 0) 
    { 
    // Two empty strings are equal 
    sim = 1; 
    } 
    else 
    { 
    size_t len; 
    // Append s1 to itself in s1s1 (e.g. "aircon" -> "airconaircon") 
    char* s1s1 = malloc(s1len * 2 + 1); 
    strcpy(s1s1, s1); 
    strcpy(s1s1 + s1len, s1); 

    // Find the length of the LCS between s1s1 and s2 
    // (e.g. between "airconaircon" and "conair") 
    len = llcs(s1s1, s2); 
    // We need it not longer than s1 (e.g. "aircon") 
    // since we're actually comparing s1 and s2 
    if (len > s1len) len = s1len; 

    len *= 2; 

    // Prevent 100% similarity between a string and its 
    // cyclically shifted version (e.g. "aircon" and "conair") 
    if (len == s1len + s2len && strcmp(s1, s2) != 0) len--; 

    // Get the final measure of the similarity 
    sim = (double)len/(s1len + s2len); 

    free(s1s1); 
    } 

    return sim; 
} 

int main(int argc, char** argv) 
{ 
    if (argc == 3) 
    printf("Similarity of \"%s\" and \"%s\" is %.2f%%\n", 
      argv[1], argv[2], 100 * similarity(argv[1], argv[2])); 
    else 
    printf("Usage:\n %s string1 string2\n", 
      argv[0]); 
    return 0; 
}

示例輸出：

Similarity of "123" and "123" is 100.00% 
Similarity of "123" and "1234" is 85.71% 
Similarity of "0123" and "123" is 85.71% 
Similarity of "a" and "aa" is 66.67% 
Similarity of "aa" and "a" is 66.67% 
Similarity of "aaaaaaa" and "aaaaaa" is 92.31% 
Similarity of "aaaaaa" and "aaaaaaa" is 92.31% 
Similarity of "aircon" and "conair" is 91.67% 
Similarity of "spit" and "pits" is 87.50% 
Similarity of "pits" and "spit" is 87.50% 
Similarity of "spits" and "pits" is 88.89% 
Similarity of "pits" and "spits" is 88.89%

來源

2012-07-27 13:52:29

謝謝，我確實實施了這種方法。我不認爲這種方法本身是找出兩個字符串之間相似性的最好方法（因爲它沒有正確考慮很多情況），但是如果您也使用其他方法，它絕對是一個好方法。所以我也可以加上這個規則，用另一個規則來計算相似度。 – 2012-08-01 17:44:08

添加換位是微不足道的。 – 2012-08-01 18:06:59

看一看到EMBOSS軟件包，或史密斯 - 沃特曼算法。它們用於處理字符串匹配，通過適用於DNA序列的編輯距離，在任何地方任何類型的插入，翻轉，轉座子可能發生到任何長度。說這個，我需要補充說，對於一個足夠長的字符串沒有最佳的解決方案。不要忘記，編輯成本取決於算法的使用上下文（語義問題），而任何算法總是一個語法機器。

來源

2014-01-11 22:50:26 monnoo

嘗試使用其他類似措施，如索倫森，捷卡和jaro_winkler

我個人是哈羅溫克勒的忠實粉絲，因爲它曾我的目的了很多次。

from Levenshtein import jaro_winkler 
In [2]: jaro_winkler("conair","aircon") 
Out[2]: 0.8333333333333334

來源

2016-09-12 05:57:53 Yaswanth

字符串相似性 - > Levenshtein距離

回答

相關問題