2016-06-13 118 views
1

的話這是我的代碼:同義詞由Levenshtein距離

public void SearchWordSynonymsByLevenstein() 
{ 
    foreach (var eachWord in wordCounter) 
    { 
     foreach (var eachSecondWord in wordCounter) 
     { 
      if (eachWord.Key.Length > 3) 
      { 
       var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key); 
       if (score < 2) 
       { 
        if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key))) 
        { 
         if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key)) 
         { 
          wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } }); 
         } 
         else 
         { 
          wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value); 
         } 
        } 
       } 
      } 
     } 
    } 
} 

wordCounterDictionary<string, int>,其中關鍵是我的每一個字和值是計算有多少文件存在這個詞。像Bag的字。我必須從其他eachSecondWord搜索eachWord的同義詞。這種方法花費了太多時間。時間呈指數增長。還有其他辦法可以縮短時間嗎?

+0

'wordSynonymsByLevenstein'確實需要一個'Dictionary >'?爲什麼不只是一個'Dictionary >'?你可以使用它來找到「同義詞」,然後到「wordCounter」的計數。 – juharr

+0

感謝,後來我這樣做: '如果(wordSynonymsByLevenstein.TryGetValue(eachMainWord,出isThisWord)){ \t的foreach(在isThisWord VAR eachWw) \t { \t \t mainWordWithSynonyms.Add(eachWw.Key); \t \t fullCounted = fullCounted + eachWw.Value; \t} \t var distinctedWord = mainWordWithSynonyms.DistinctBy(x => x).ToList(); (y => y == x))&& compFoundWords.Any(x => distinctedWord.Any(y => y == x))) \t { \t \t relationScore = relationScore +((double)1 /(double)fullCounted); \t \t countingEqualWord ++; \t} }''所以必須wordSynonymsByLevenshtein'是這樣'Dictionary' – Sidron

+0

我想說的是,如果'wordSynonymsByLevenstein'是'詞典<字符串,列表',那麼你會得到'isThisWord'出來,它將單詞列表,所以改變'eachWw.Key'到'eachWw'和'eachWw.Value'到'wordCounter [eachWw]' – juharr

回答

1

首先,我假設您不想在wordSynonymsByLevenstein集合中將單詞與自己關聯起來。其次,你可以通過比較單詞的長度來跳過那些你知道不符合你的分數要求的單詞。

public void SearchWordSynonymsByLevenstein() 
{ 
    foreach (var eachWord in wordCounter) 
    { 
     foreach (var eachSecondWord in wordCounter) 
     { 
      if (eachWord.Key == eachSecondWord.Key 
       || eachWord.Key.Length <= 3 
       || Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2) 
      { 
       continue; 
      } 
      var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key); 
      if (score >= 2) 
      { 
       continue; 
      } 

      if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key))) 
      { 
       if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key)) 
       { 
        wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } }); 
       } 
       else 
       { 
        wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value); 
       } 
      } 

     } 
    } 
} 

你的要求,即與if(!wordSynonymsByLevenstein.Any(x => x.Value.ContainsKey(eachSecondWord.Key)))表達還不是特別明顯或直線前進,但如果你不想與一個以上的相關的詞,那麼你可以另外添加一個HashSet<string>,併爲您關聯詞將它們添加到HashSet並檢查下一個單詞是否在繼續之前,而不是迭代嵌套字典。

public void SearchWordSynonymsByLevenstein() 
{ 
    var used = new HashSet<string>(); 
    foreach (var eachWord in wordCounter) 
    { 
     foreach (var eachSecondWord in wordCounter) 
     { 
      if (eachWord.Key == eachSecondWord.Key 
       || eachWord.Key.Length <= 3 
       || Math.Abs(eachWord.Key.Length - eachSecondWord.Key.Length) >= 2) 
      { 
       continue; 
      } 
      var score = LevenshteinDistance.Compute(eachWord.Key, eachSecondWord.Key); 
      if (score >= 2) 
      { 
       continue; 
      } 

      if(used.Add(eachSecondWord.Key))) 
      { 
       if (!wordSynonymsByLevenstein.ContainsKey(eachWord.Key)) 
       { 
        wordSynonymsByLevenstein.Add(eachWord.Key, new Dictionary<string, int> { { eachSecondWord.Key, eachSecondWord.Value } }); 
       } 
       else 
       { 
        wordSynonymsByLevenstein[eachWord.Key].Add(eachSecondWord.Key, eachSecondWord.Value); 
       } 
      } 

     } 
    } 
} 

在這裏,我用if(used.Add(eachSecondWord.Key)))因爲Add將返回true如果加入這個詞和false,如果它已經在HashSet

+0

感謝您的精彩提示:)此'Math.Abs​​'確實有助於縮短時間。我將這個字典改爲你所說的並從'wordCounter'中獲得計數值。謝謝 :) – Sidron