編碼兩種不同的聚類方法

我使用兩種不同的聚類方法來生成兩個聚類結果，每種聚類方法包含10個不同的組。但是，它們的編碼方式不同。下面的例子顯示了聚類結果：編碼兩種不同的聚類方法

set.seed(1) 

Df <- data.frame(Var1 = sample(1:6, 100, replace =T), Var2 = sample(1:6,100, replace =T)) 

table(Df)

我想找到這兩種方法之間的百分比協議（或協議的數量），並重新編寫Cluster2中到Cluster1中的水平，使他們有最大百分比協議（或案件數量）。我寫了一些算法來做到這一點，但在集羣數量增加後並不是非常成功。我的數據集有超過100000個案例。

來源

2017-02-09 Kevin Zheng

表（DF）/ nrow（DF） – AidanGawronski

我的目標是通過分配以最大化百分比協議，B，C到第2組，所以1,2,3在簇2將成爲A，B，C以及。在這種情況下，3將是B，1變成A，3變成C.我可以使用表（Df）來查找最大匹配的成員資格，但有時候事情會因多重匹配而變得複雜。 –

Df $ Var2 < - Df $ Var1 ...大聲笑現在你有100％的協議！儘管如此，我不知道你在做什麼。 – AidanGawronski

思考之後，我想我找到了一個簡單的回答我的問題。我可以簡單地使用一個循環來修剪它並找到匹配。

set.seed (1) 
df <- data.frame(Cluster1 = sample(LETTERS[1:n], c, replace =T), Cluster2 = sample(1:n,c, replace =T)) 
findmatch <- function(df, group1 = "Cluster1", group2 = "Cluster2") { 
    n <- length(unique(df[, group1])) 
    matches <- matrix(NA, n, 2) 
    for(i in 1:n) { 
     if(i==1) { 
     table1 <- table(df[, group1], df[,group2]) 
     } else if(i<n) { 
     table1 <- table1[-maxs[1],-maxs[2]] 
     } 
     maxs <- which(table1 == max(table1), arr.ind = TRUE) 
     if(i < n) { 
     matches[i,1:2] <- c(rownames(table1)[maxs[1]], colnames(table1)[maxs[2]])  
     } else { 
     matches[i,1:2] <- c(rownames(table1)[-maxs[1]], colnames(table1)[-maxs[2]])  
    } 
    } 
    return(matches) 
} 
findmatch(df=df) 


     [,1] [,2] 
[1,] "J" "5" 
[2,] "I" "7" 
[3,] "A" "6" 
[4,] "E" "3" 
[5,] "D" "10" 
[6,] "C" "8" 
[7,] "B" "1" 
[8,] "F" "9" 
[9,] "H" "2" 
[10,] "G" "4"

來源

2017-02-15 17:22:09

這可能有點霰彈槍的方法，因爲我不知道真實數據中有多少個簇。我在這裏嘗試所有可能的組合：

df <- data.frame(Cluster1 = c("A","A", "B", "B", "C","C", "C"), 
       Cluster2 = c("1", "2", "3", "3", "2","1","3")) 

require(gtools) 
comb <- permutations(n = 3, r = 3, v = 1:3) 

#try every combination and count the matches 
nmatch <- apply(comb,1,function(x) sum(LETTERS[match(df$Cluster2,x)] == df$Cluster1)) 

#pick the best performing translation 
best <- comb[which.max(nmatch),] 
# generate translation table 
data.frame(Cluster2 = 1:3, Cluster2new = LETTERS[best])

結果：

Cluster2 Cluster2new 
1  1   A 
2  2   C 
3  3   B

新的示例數據：

set.seed(314) 
df <- data.frame(Cluster1 = sample(LETTERS[1:6], 100, replace =T), Cluster2 = sample(1:6,100, replace =T)) 

require(gtools) 
comb <- permutations(n = 6, r = 6, v = 1:6) 

#try every combination and count the matches 
nmatch <- apply(comb,1,function(x) sum(LETTERS[match(df$Cluster2,x)] == df$Cluster1)) 

#pick the best performing translation 
best <- comb[which.max(nmatch),] 
# generate translation table 
data.frame(Cluster2 = 1:3, Cluster2new = LETTERS[best])

結果：

Cluster2 Cluster2new 
1  1   B 
2  2   D 
3  3   C 
4  1   A 
5  2   E 
6  3   F

計算排列似乎是限制因素。因此，我有一個替代解決方案，隨機抽樣檢查可能性，並計算匹配百分比。這種方法要快得多，但可能不包含問題的最佳解決方案。

set.seed(314) 

c = 10000 
n = 10 
tries = 1000 

df <- data.frame(Cluster1 = sample(LETTERS[1:n], c, replace =T), Cluster2 = sample(1:n,c, replace =T)) 

#try every combination and count the matches 
nmatch <- sapply(1:tries,function(x) { 
    set.seed(x) 
    comb <- sample(1:n,n) 
    sum(LETTERS[match(df$Cluster2,comb)] == df$Cluster1) 
    }) 

#pick the best performing translation 
best <- which.max(nmatch) 
# generate translation table 
set.seed(best) 
data.frame(Cluster2 = 1:n, Cluster2new = LETTERS[sample(1:n,n)]) 

nmatch[best]/c

結果：

Cluster2 Cluster2new 
1   1   B 
2   2   J 
3   3   D 
4   4   C 
5   5   A 
6   6   G 
7   7   E 
8   8   F 
9   9   I 
10  10   H 
> 
    > nmatch[best]/c 
[1] 0.1099

或較慢的迭代PROCES：

solve <- function(start) 
{ 
    sol <- integer() 
    start <- sample(1:n) 
    left <- start 
    for(i in start){ 

    nmatch <- sapply(left, function(x) { 
     cl <- df[df$Cluster2==x,] 
     sum(LETTERS[cl$Cluster2] == cl$Cluster1) 
    }) 
    ix <- which.max(nmatch) 
    sol[i] <- left[ix] 
    left <- left[-ix] 
    } 
    sol 
} 

nmatch <- sapply(1:tries, function(x) { 
    set.seed(x) 
    sum(LETTERS[match(df$Cluster2,solve(sample(1:n)))] == df$Cluster1) 
}) 

best <- which.max(nmatch) 

data.frame(Cluster2 = 1:n, Cluster2new = LETTERS[sample(1:n,n)]) 

nmatch[best]/c

結果：

Cluster2 Cluster2new 
1   1   D 
2   2   G 
3   3   C 
4   4   I 
5   5   E 
6   6   A 
7   7   B 
8   8   J 
9   9   F 
10  10   H 
>  
    >  nmatch[best]/c 
[1] 0.1121

作爲一個例證，第二隨機理線，當你看的nmatch分佈每個方法可能是得到一個很好的解決方案，更好地：

來源

2017-02-09 19:52:49 Wietze314

這種方法是有前景的，並且適用於少量的簇。我的問題是，在兩個聚類方法中，每個都有15個聚類，計算時間太長！ –

我增加了另一種方法，嘗試使其更快但不太準確。 – Wietze314

另一種方法。不知道有比我更數學背景的人是否能更好地解決這個問題？ – Wietze314

編碼兩種不同的聚類方法

回答

相關問題