檢查對象數組的唯一性

我正在從文件（如CSV和Excel）讀取數據，並且需要確保文件中的每一行都是唯一的。檢查對象數組的唯一性

每行將被表示爲object[]。由於當前的體系結構，這不能改變。此陣列中的每個對象可以有不同的類型（decimal,string,int等）。

的文件可以這個樣子：

foo 1  5 // Not unique 
bar 1  5 
bar 2  5 
foo 1  5 // Not unique

的文件可能有200.000+行和列4-91。

我現在所擁有的代碼看起來是這樣的：

IList<object[]> rows = new List<object[]>(); 

using (var reader = _deliveryObjectReaderFactory.CreateReader(deliveryObject)) 
{ 
    // Read the row. 
    while (reader.Read()) 
    { 
     // Get the values from the file. 
     var values = reader.GetValues(); 

     // Check uniqueness for row 
     foreach (var row in rows) 
     { 
      bool rowsAreDifferent = false; 

      // Check uniqueness for column. 
      for (int i = 0; i < row.Length; i++) 
      { 
       var earlierValue = row[i]; 
       var newValue = values[i]; 
       if (earlierValue.ToString() != newValue.ToString()) 
       { 
        rowsAreDifferent = true; 
        break; 
       } 
      } 
      if(!rowsAreDifferent) 
       throw new Exception("Rows are not unique"); 
     } 
     rows.Add(values); 
    } 
}

所以，我的問題，是否可以更有效地完成？如使用散列，並檢查散列的唯一性呢？

來源

2016-05-17 smoksnes

你確實意識到兩個對象可能具有相同的散列並且仍然不相等，不是嗎？換句話說，如果你的哈希是正確的，一個文件可能有重複哈希，但仍然有唯一的行。 – phoog

與自定義相等比較器一起使用HashSet 怎麼樣？ – Jehof

@phoog，是的，我很清楚這一點。解決方案將首先檢查散列，如果散列相等，則必須檢查其他值。但是也許首先檢查散列效率更高，而不是總是檢查所有的值。 – smoksnes

你可以使用一個HashSet<object[]>與自定義IEqualityComparer<object[]>這樣的：

HashSet<object[]> rows = new HashSet<object[]>(new MyComparer()); 

while (reader.Read()) 
{ 
    // Get the values from the file. 
    var values = reader.GetValues();  
    if (!rows.Add(values)) 
     throw new Exception("Rows are not unique"); 
}

這MyComparer可以實現這樣的：

public class MyComparer : IEqualityComparer<object[]> 
{ 
    public bool Equals(object[] x, object[] y) 
    { 
     if (ReferenceEquals(x, y)) return true; 
     if (ReferenceEquals(x, null) || ReferenceEquals(y, null) || x.Length != y.Length) return false; 
     return x.Zip(y, (a, b) => a == b).All(c => c); 
    } 
    public int GetHashCode(object[] obj) 
    { 
     unchecked 
     { 
      // this returns 0 if obj is null 
      // otherwise it combines the hashes of all elements 
      // like hash = (hash * 397)^nextHash 
      // if an array element is null its hash is assumed as 0 
      // (this is the ReSharper suggestion for GetHashCode implementations) 
      return obj?.Aggregate(0, (hash, o) => (hash * 397)^(o?.GetHashCode() ?? 0)) ?? 0; 
     } 
    } 
}

我不能完全肯定是否a==b部分作品適用於所有類型。

來源

2016-05-17 06:34:37

哦，只是看到@Jehof已經建議這個，當我正在寫，所以你可能已經知道如何做到這一點... –

是的，我試了一下現在。但沒有花哨的C＃6特性。 ;） – smoksnes

最後的回報聲明看起來很可怕。我可能需要大量的咖啡和15分鐘的時間來弄清楚它爲什麼會這樣做。你介意添加一行還是兩行，評論'？'操作符，以及爲什麼你乘以391？ – Marco

檢查對象數組的唯一性

回答

相關問題