2016-03-08 95 views
4

我正在使用分層聚類來聚合單詞向量,並且我希望用戶能夠顯示顯示簇的樹形圖。但是,由於可能有數千個單詞,因此我希望將此樹狀圖截斷爲一些合理的有價值的內容,每個葉的標籤是該羣中最重要的單詞的一串。顯示scipy樹形圖的簇標籤

我的問題是,according to the docs,「標籤[i]值是放在第i個葉子節點下的文本,只有當它對應於原始觀察而不是非單一羣集時。」我認爲這意味着我不能標記羣集,只有奇點?

爲了說明,這裏是一個簡短的Python腳本生成一個簡單的標記樹狀圖:

import numpy as np 
from scipy.cluster.hierarchy import dendrogram, linkage 
from matplotlib import pyplot as plt 

randomMatrix = np.random.uniform(-10,10,size=(20,3)) 
linked = linkage(randomMatrix, 'ward') 

labelList = ["foo" for i in range(0, 20)] 

plt.figure(figsize=(15, 12)) 
dendrogram(
      linked, 
      orientation='right', 
      labels=labelList, 
      distance_sort='descending', 
      show_leaf_counts=False 
     ) 
plt.show() 

a dendrogram of randomly generated points

現在讓我們假設我要截斷到只有5葉,每個葉,標籤它就像「foo,foo,foo ...」,即構成該羣集的單詞。 (注:產生這些標籤是不是這裏的問題。)我截斷它,並提供一個標籤列表匹配:

labelList = ["foo, foo, foo..." for i in range(0, 5)] 
dendrogram(
      linked, 
      orientation='right', 
      p=5, 
      truncate_mode='lastp', 
      labels=labelList, 
      distance_sort='descending', 
      show_leaf_counts=False 
     ) 

和這裏的問題,沒有標籤:

enter image description here

我我想這裏可能有用於參數'leaf_label_func'的用法,但我不確定如何使用它。

回答

0

你對使用leaf_label_func參數是正確的。

除了創建一個圖,樹狀圖函數還會返回一個包含多個列表的字典(他們稱之爲R在文檔中)。您創建的leaf_label_func必須從R [「leaves」中取值並返回所需的標籤。設置標籤的最簡單方法是運行樹狀圖兩次。使用no_plot=True獲取用於創建標籤貼圖的字典。然後再創建該情節。

randomMatrix = np.random.uniform(-10,10,size=(20,3)) 
linked = linkage(randomMatrix, 'ward') 

labels = ["A", "B", "C", "D"] 
p = len(labels) 

plt.figure(figsize=(8,4)) 
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20) 
plt.xlabel('Look at my fancy labels!', fontsize=16) 
plt.ylabel('distance', fontsize=16) 

# call dendrogram to get the returned dictionary 
# (plotting parameters can be ignored at this point) 
R = dendrogram(
       linked, 
       truncate_mode='lastp', # show only the last p merged clusters 
       p=p, # show only the last p merged clusters 
       no_plot=True, 
       ) 

print("values passed to leaf_label_func\nleaves : ", R["leaves"]) 

# create a label dictionary 
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))} 
def llf(xx): 
    return "{} - custom label!".format(temp[xx]) 

## This version gives you your label AND the count 
# temp = {R["leaves"][ii]:(labels[ii], R["ivl"][ii]) for ii in range(len(R["leaves"]))} 
# def llf(xx): 
#  return "{} - {}".format(*temp[xx]) 


dendrogram(
      linked, 
      truncate_mode='lastp', # show only the last p merged clusters 
      p=p, # show only the last p merged clusters 
      leaf_label_func=llf, 
      leaf_rotation=60., 
      leaf_font_size=12., 
      show_contracted=True, # to get a distribution impression in truncated branches 
      ) 
plt.show()