K意味着條件

1

符號：
- K是簇
的數量 - 讓我們說，前兩個變量是點coordinnates（X，Y）
- V表示第三可變
- 次：總和伏多個每個簇i的
- S的總和（和Ci）
- 和閾值T

問題定義：
從我所瞭解的目標是運行一種算法，在尊重約束的同時保持kmeans的精神。

任務1 - 基團由鄰近點到質心[k均值]
任務2 - 爲每個羣集I，Cl> T * [約束]

普通k均值限制用於約束問題：
常規kmeans，通過以任意順序將它們分配給質心。在我們的情況下，這會導致詞的增長失控，同時增加點數。

例如，K = 2，T = 40和4點，第三個變量等於V1 = 50，V2 = 1，V3 = 50，V4 = 50。也假設點P1，P3，P4更接近質心1點P2靠近質心2.

現在運行規則的k均值的assignement一步，並保持詞的軌跡：
1--起飛點P1，將其分配到簇1. C1 = 50> T
2 - 取點P2，將其分配到簇2 C2 = 1
3 - 取點P3，將其分配到簇1。 => C1增長太多！
4--取P4點，分配給簇1. C1 = 150> T => !!!

修改k均值：
在前面，我們要防止增長過多C1和C2幫助成長。

這就像將香檳倒入幾杯：如果你看到一杯香檳較少的杯子，你就去填充它。你這樣做是因爲你有約束：有限的champaigne（S有界）和因爲你想要每個玻璃都有足夠的香檳（Ci> T）。

當然這只是一個比喻。我們修改後的kmeans將以最小的Ci爲集羣添加新的分支，直到達到約束（Task2）。現在我們應該添加點數？靠近質心（Task1）。在所有集羣i完成所有約束條件後，我們可以在剩餘的未分配點上運行常規kmeans。

執行：
接下來，我給出了修改算法的python實現。圖1顯示了使用透明度來渲染大VS低值的第三個變量的復現。圖2顯示了使用顏色的演化羣集。

你可以玩accept_thresh參數。請注意：
對於accept_thresh = 0 =>常規kmeans（立即達到約束）
對於accept_thresh = third_var.sum（）。sum（）/（2 * K），您可能會觀察到某些點由於約束原因，更接近給定的質心會受到另一個質心的影響。

CODE：

import numpy as np 
import matplotlib.pyplot as plt 
from sklearn import datasets 
import time 

nb_samples = 1000 
K = 3 # for demo purpose, used to generate cloud points 
c_std = 1.2 

# Generate test samples : 
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \ 
             centers=K, cluster_std=c_std) 

third_var_distribution = 'cubic_bycluster' # 'uniform' 

if third_var_distribution == 'uniform': 
    third_var = np.random.random((nb_samples)) 
elif third_var_distribution == 'linear_bycluster': 
    third_var = np.random.random((nb_samples)) 
    third_var = third_var * classes 
elif third_var_distribution == 'cubic_bycluster': 
    third_var = np.random.random((nb_samples)) 
    third_var = third_var * classes 


# Threshold parameters : 
# Try with K=3 and : 
# T = K => one cluster reach cosntraint, two clusters won't converge 
# T = 2K => 
accept_thresh = third_var.sum().sum()/(2*K) 


def dist2centroids(points, centroids): 
    '''return arrays of ordered points to each centroids 
     first array is index of points 
     second array is distance to centroid 
     dim 0 : centroid 
     dim 1 : distance or point index 
    ''' 
    dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2)) 
    ord_dist_indices = np.argsort(dist, axis=1) 

    ord_dist_indices = ord_dist_indices.transpose() 
    dist = dist.transpose() 

    return ord_dist_indices, dist 


def assign_points_with_constraints(inds, dists, tv, accept_thresh): 
    assigned = [False] * nb_samples 
    assignements = np.ones(nb_samples, dtype=int) * (-1) 
    cumul_third_var = np.zeros(K, dtype=float) 
    current_inds = np.zeros(K, dtype=int) 

    max_round = nb_samples * K 

    for round in range(0, max_round): # we'll break anyway 
     # worst advanced cluster in terms of cumulated third_var : 
     cluster = np.argmin(cumul_third_var) 

     if cumul_third_var[cluster] > accept_thresh: 
      continue # cluster had enough samples 

     while current_inds[cluster] < nb_samples: 
      # add points to increase cumulated third_var on this cluster 
      i_inds = current_inds[cluster] 
      closest_pt_index = inds[i_inds][cluster] 

      if assigned[closest_pt_index] == True: 
       current_inds[cluster] += 1 
       continue # pt already assigned to a cluster 

      assignements[closest_pt_index] = cluster 
      cumul_third_var[cluster] += tv[closest_pt_index] 
      assigned[closest_pt_index] = True 
      current_inds[cluster] += 1 

      new_cluster = np.argmin(cumul_third_var) 
      if new_cluster != cluster: 
       break 

    return assignements, cumul_third_var 


def assign_points_with_kmeans(points, centroids, assignements): 
    new_assignements = np.array(assignements, copy=True) 

    count = -1 
    for asg in assignements: 
     count += 1 

     if asg > -1: 
      continue 

     pt = points[count, :] 

     distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1)) 
     centroid = np.argmin(distances) 

     new_assignements[count] = centroid 

    return new_assignements 


def move_centroids(points, labels): 
    centroids = np.zeros((K, 2), dtype=float) 

    for k in range(0, K): 
     centroids[k] = points[assignements == k].mean(axis=0) 

    return centroids 


rgba_colors = np.zeros((third_var.size, 4)) 
rgba_colors[:, 0] = 1.0 
rgba_colors[:, 3] = 0.1 + (third_var/max(third_var))/1.12 
plt.figure(1, figsize=(14, 14)) 
plt.title("Three blobs", fontsize='small') 
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors) 

# Initialize centroids 
centroids = np.random.random((K, 2)) * 10 
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red') 

# Step 1 : order points by distance to centroid : 
inds, dists = dist2centroids(points, centroids) 

# Check if clustering is theoriticaly possible : 
tv_sum = third_var.sum() 
tv_max = third_var.max() 
if (tv_max > 1/3 * tv_sum): 
    print("No solution to the clustering problem !\n") 
    print("For one point : third variable is too high.") 
    sys.exit(0) 

stop_criter_eps = 0.001 
epsilon = 100000 
prev_cumdist = 100000 

plt.figure(2, figsize=(14, 14)) 
ln, = plt.plot([]) 
plt.ion() 
plt.show() 

while epsilon > stop_criter_eps: 

    # Modified kmeans assignment : 
    assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh) 

    # Kmeans on remaining points : 
    assignements = assign_points_with_kmeans(points, centroids, assignements) 

    centroids = move_centroids(points, assignements) 

    inds, dists = dist2centroids(points, centroids) 

    epsilon = np.abs(prev_cumdist - dists.sum().sum()) 

    print("Delta on error :", epsilon) 

    prev_cumdist = dists.sum().sum() 

    plt.clf() 
    plt.title("Current Assignements", fontsize='small') 
    plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements) 
    plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10) 
    plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large') 
    for k in range(0, K): 
     plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k])) 
    plt.show() 
    plt.pause(1)

改進：
- 使用第三可變的工作分配的分佈。
- 管理算法
的分歧 - 更好的初始化（k均值++）

來源

2017-05-12 15:47:59 MHICV

+0

非常感謝您的支持！這對我幫助很大。我也想第二個解決方案：基本上，它是這樣的： - 沒有約束應用KMEANS - 留着備用滿足約束集羣 - 組羣不滿足約束 - 運行KMEANS - 迭代 –

+0

不客氣。您的解決方案可能會取決於您在每次迭代中選擇的k的值。但是請注意，對於算法的第3步，如果有N個不滿足約束條件的集羣，則無法在滿足約束的N個集羣中找到剩餘點的新分區，因爲全局數量SUM_over_points（third_variable）< N x閾值。因此，在步驟4（'運行Kmeans'）：你必須選擇K嚴格地低於N的kmeans。 – MHICV

+0

的確，在每次迭代中，我選擇k = int（sum_over_points（third_value）/ threshold）。最後，我自然有一個不尊重約束的集羣，但這是我可以做的最好的，以保持其他集羣的顯着性 –

0

處理此問題的一種方法是在羣集之前過濾數據。

>>> cluster_data = df.loc[df['third_variable'] > some_value] >>> from sklearn.cluster import KMeans >>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)

如果和你的意思是每個集羣的第三個變量的總和，那麼你可以使用RandomSearchCV找到KMeans該做或不符合條件的超參數。

來源

2017-05-10 14:34:08 spies006

+0

事實上，和我的意思是每個集羣的第三個變量的總和，可以請您解釋一點如何在這種情況下使用RandomSearchCV？ –

+0

@YahyaMortassim在我這樣做之前，讓我問你一個問題。您是否想要更改KMeans算法，或者您是否可以搜索一系列hyperparams並找到符合您的條件的算法？ – spies006

+0

我正在尋找改變KMeans算法 –

0

K-means本身就是一個優化問題。

您的附加約束也是一個相當常見的優化約束。

所以我寧願用一個優化求解器來解決這個問題。

來源

2017-05-10 19:41:41

K意味着條件

回答

相關問題