2017-10-19 184 views
0

***<code>enter image description here</code>***多元異常值去除隨着馬氏距離

我有這個數據,有異常。我怎樣才能找到Mahalanobis disantance 並用它來刪除異常值。

+0

馬哈拉諾比斯距離適用於IID數據(見[此信息以供異常檢測](HTTP: //kldavenport.com/mahalanob是距離 - 和 - 離羣/))。但是你的數據不是iid。 – Maxim

回答

3

在多元數據中,如果變量之間存在協方差(,即您的案例X,Y,Z中的),則歐幾里德距離將失敗。 enter image description here

因此,什麼馬氏距離所做的是,

  1. 它把變量到不相關的空間。

  2. 使每個變量varience等於1

  3. 然後計算簡單的歐氏距離。

我們可以計算出馬氏距離對於每個數據樣本如下,

enter image description here

在這裏,我已經提供了Python代碼並添加註釋,以便您可以理解的代碼。

import numpy as np 

data= np.matrix([[1, 2, 3, 4, 5, 6, 7, 8],[1, 4, 9, 16, 25, 36, 49, 64],[1, 4, 9, 16, 25, 16, 49, 64]]) 

def MahalanobisDist(data): 
    covariance_xyz = np.cov(data) # calculate the covarince matrix 
    inv_covariance_xyz = np.linalg.inv(covariance_xyz) #take the inverse of the covarince matrix 
    xyz_mean = np.mean(data[0]),np.mean(data[1]),np.mean(data[2]) 
    x_diff = np.array([x_i - xyz_mean[0] for x_i in x]) # take the diffrence between the mean of X variable the sample 
    y_diff = np.array([y_i - xyz_mean[1] for y_i in y]) # take the diffrence between the mean of Y variable the sample 
    z_diff = np.array([z_i - xyz_mean[2] for z_i in z]) # take the diffrence between the mean of Z variable the sample 
    diff_xyz = np.transpose([x_diff, y_diff, z_diff]) 

    md = [] 
    for i in range(len(diff_xyz)): 
     md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xyz[i]),inv_covariance_xyz),diff_xyz[i]))) #calculate the Mahalanobis Distance for each data sample 
    return md 

def MD_removeOutliers(data): 
    MD = MahalanobisDist(data) 
    threshold = np.mean(MD) * 1.5 # adjust 1.5 accordingly 
    outliers = [] 
    for i in range(len(MD)): 
     if MD[i] > threshold: 
      outliers.append(i) # index of the outlier 
    return np.array(outliers) 

print(MD_removeOutliers(data)) 

希望這會有所幫助。

引用,

  1. http://mccormickml.com/2014/07/21/mahalanobis-distance/

  2. ​​

  3. https://www.youtube.com/watch?v=3IdvoI8O9hU&t=540s
+0

我無法找到有MahalanobisDist的圖書館,請告訴圖書館。如果你解釋它會很有幫助。 –

+0

我編輯答案 –

+0

真棒答案!現在你能告訴我爲什麼openCv的Mahalanobis要求多組數據? (DATA1,DATA2,inverted_covariance) –