這可能有點霰彈槍的方法,因爲我不知道真實數據中有多少個簇。我在這裏嘗試所有可能的組合:
df <- data.frame(Cluster1 = c("A","A", "B", "B", "C","C", "C"),
Cluster2 = c("1", "2", "3", "3", "2","1","3"))
require(gtools)
comb <- permutations(n = 3, r = 3, v = 1:3)
#try every combination and count the matches
nmatch <- apply(comb,1,function(x) sum(LETTERS[match(df$Cluster2,x)] == df$Cluster1))
#pick the best performing translation
best <- comb[which.max(nmatch),]
# generate translation table
data.frame(Cluster2 = 1:3, Cluster2new = LETTERS[best])
結果:
Cluster2 Cluster2new
1 1 A
2 2 C
3 3 B
新的示例數據:
set.seed(314)
df <- data.frame(Cluster1 = sample(LETTERS[1:6], 100, replace =T), Cluster2 = sample(1:6,100, replace =T))
require(gtools)
comb <- permutations(n = 6, r = 6, v = 1:6)
#try every combination and count the matches
nmatch <- apply(comb,1,function(x) sum(LETTERS[match(df$Cluster2,x)] == df$Cluster1))
#pick the best performing translation
best <- comb[which.max(nmatch),]
# generate translation table
data.frame(Cluster2 = 1:3, Cluster2new = LETTERS[best])
結果:
Cluster2 Cluster2new
1 1 B
2 2 D
3 3 C
4 1 A
5 2 E
6 3 F
計算排列似乎是限制因素。因此,我有一個替代解決方案,隨機抽樣檢查可能性,並計算匹配百分比。這種方法要快得多,但可能不包含問題的最佳解決方案。
set.seed(314)
c = 10000
n = 10
tries = 1000
df <- data.frame(Cluster1 = sample(LETTERS[1:n], c, replace =T), Cluster2 = sample(1:n,c, replace =T))
#try every combination and count the matches
nmatch <- sapply(1:tries,function(x) {
set.seed(x)
comb <- sample(1:n,n)
sum(LETTERS[match(df$Cluster2,comb)] == df$Cluster1)
})
#pick the best performing translation
best <- which.max(nmatch)
# generate translation table
set.seed(best)
data.frame(Cluster2 = 1:n, Cluster2new = LETTERS[sample(1:n,n)])
nmatch[best]/c
結果:
Cluster2 Cluster2new
1 1 B
2 2 J
3 3 D
4 4 C
5 5 A
6 6 G
7 7 E
8 8 F
9 9 I
10 10 H
>
> nmatch[best]/c
[1] 0.1099
或較慢的迭代PROCES:
solve <- function(start)
{
sol <- integer()
start <- sample(1:n)
left <- start
for(i in start){
nmatch <- sapply(left, function(x) {
cl <- df[df$Cluster2==x,]
sum(LETTERS[cl$Cluster2] == cl$Cluster1)
})
ix <- which.max(nmatch)
sol[i] <- left[ix]
left <- left[-ix]
}
sol
}
nmatch <- sapply(1:tries, function(x) {
set.seed(x)
sum(LETTERS[match(df$Cluster2,solve(sample(1:n)))] == df$Cluster1)
})
best <- which.max(nmatch)
data.frame(Cluster2 = 1:n, Cluster2new = LETTERS[sample(1:n,n)])
nmatch[best]/c
結果:
Cluster2 Cluster2new
1 1 D
2 2 G
3 3 C
4 4 I
5 5 E
6 6 A
7 7 B
8 8 J
9 9 F
10 10 H
>
> nmatch[best]/c
[1] 0.1121
作爲一個例證,第二隨機理線,當你看的nmatch
分佈每個方法可能是得到一個很好的解決方案,更好地:
表(DF)/ nrow(DF) – AidanGawronski
我的目標是通過分配以最大化百分比協議,B,C到第2組,所以1,2,3在簇2將成爲A,B,C以及。在這種情況下,3將是B,1變成A,3變成C.我可以使用表(Df)來查找最大匹配的成員資格,但有時候事情會因多重匹配而變得複雜。 –
Df $ Var2 < - Df $ Var1 ...大聲笑現在你有100%的協議!儘管如此,我不知道你在做什麼。 – AidanGawronski