2016-03-08 93 views
0

我有這個數據框稱爲mydf其中我有三個主要協變量(PCA.1,PCA.2,PCA.3)。我想得到3d距離矩陣,並得到所有比較的最短歐幾里得距離。在另一個稱爲myref的數據框中,我有一些已知的身份Samples和一些unknown樣本。通過計算mydf中最短的歐氏距離,我想將已知的Identity分配給未知樣本。有人可以幫助我完成這件事。3D歐氏距離以識別未知樣本

是myDF

mydf <- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8", 
"9", "10", "12"), PCA.1 = c(0.00338, -0.020373, -0.019842, -0.019161, 
-0.019594, -0.019728, -0.020356, 0.043339, -0.017559, -0.020657 
), PCA.2 = c(0.00047, -0.010116, -0.011532, -0.011582, -0.013245, 
-0.011751, -0.010299, -0.005801, -0.01, -0.011334), PCA.3 = c(-0.008787, 
0.001412, 0.003751, 0.00371, 0.004242, 0.003738, 0.000592, -0.037229, 
0.004307, 0.00339)), .Names = c("Sample", "PCA.1", "PCA.2", "PCA.3" 
), row.names = c(NA, 10L), class = "data.frame") 

myref

myref<- structure(list(Sample = c("1", "2", "4", "5", "6", "7", "8", 
"9", "10", "12"), Identity = c("apple", "unknown", "ball", "unknown", 
"unknown", "car", "unknown", "cat", "unknown", "dog")), .Names = c("Sample", 
"Identity"), row.names = c(NA, 10L), class = "data.frame") 

回答

1
uIX = which(myref$Identity == "unknown") 
dMat = as.matrix(dist(mydf[, -1])) # Calculate the Euclidean distance matrix 
nn = apply(dMat, 1, order)[2, ] # For each row of dMat order the values increasing values. 
           # Select nearest neighbor (it is 2, because 1st row will be self) 
myref$Identity[uIX] = myref$Identity[nn[uIX]] 

注意,上面的代碼將設置一些身份unknown。如果您希望與具有已知身份的最近鄰居相匹配,請將第二行更改爲

dMat[uIX, uIX] = Inf 
+0

爲什麼它將某些設置爲未知?你能解釋一下你的代碼嗎? – MAPK

+0

我已添加一些評論。希望他們解釋代碼。 – jMathew

+1

如果您計算'mydf'中行的距離,您會看到一些最近的鄰居是「未知」的。例如樣本2的最近鄰居是樣本8,它是「未知的」 – jMathew