繪製文檔tfidf 2D圖

我想繪製一個2d圖，其中x軸爲term，y軸爲TFIDF評分（或文檔ID）作爲我的句子列表。我使用scikit learn的fit_transform（）來獲取scipy矩陣，但我不知道如何使用該矩陣來繪製圖表。我試圖得到一個陰謀，看看我的句子可以用kmeans進行分類。繪製文檔tfidf 2D圖

這裏是fit_transform(sentence_list)輸出：

（文檔ID，項數）TFIDF分數

(0, 1023) 0.209291711271 
    (0, 924) 0.174405532933 
    (0, 914) 0.174405532933 
    (0, 821) 0.15579574484 
    (0, 770) 0.174405532933 
    (0, 763) 0.159719994016 
    (0, 689) 0.135518787598

這裏是我的代碼：

 sentence_list=["Hi how are you", "Good morning" ...] 
     vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore') 
     vectorized=vectorizer.fit_transform(sentence_list) 
     num_samples, num_features=vectorized.shape 
     print "num_samples: %d, num_features: %d" %(num_samples,num_features) 
     num_clusters=10 
     km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1) 
     km.fit(vectorized) 
     PRINT km.labels_ # Returns a list of clusters ranging 0 to 10

感謝，

來源

2015-01-26 jxn

爲您做以下工作？它應該如果你只看一個簡單的二維圖。 http://matplotlib.org/examples/pylab_examples/simple_plot.html – 2015-01-26 23:35:41

當您使用Bag單詞，你的每個句子都會被表示在一個長度等於詞彙量的高維空間中。如果你想在2D表示這一點，你需要使用PCA，以減少尺寸，例如由兩個部分組成：

from sklearn.datasets import fetch_20newsgroups 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 
from sklearn.decomposition import PCA 
from sklearn.pipeline import Pipeline 
import matplotlib.pyplot as plt 

newsgroups_train = fetch_20newsgroups(subset='train', 
             categories=['alt.atheism', 'sci.space']) 
pipeline = Pipeline([ 
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
])   
X = pipeline.fit_transform(newsgroups_train.data).todense() 

pca = PCA(n_components=2).fit(X) 
data2D = pca.transform(X) 
plt.scatter(data2D[:,0], data2D[:,1], c=data.target) 
plt.show()    #not required if using ipython notebook

data2d

現在你可以例如計算和繪製這個數據集羣進入：

from sklearn.cluster import KMeans 

kmeans = KMeans(n_clusters=2).fit(X) 
centers2D = pca.transform(kmeans.cluster_centers_) 

plt.hold(True) 
plt.scatter(centers2D[:,0], centers2D[:,1], 
      marker='x', s=200, linewidths=3, c='r') 
plt.show()    #not required if using ipython notebook

enter image description here

來源

2015-01-29 01:12:17 elyase

是的，這就是它。謝謝！ – jxn 2015-01-29 01:46:33

我可以只使用tfidfvectorizer而不是countvectorizer然後tfidftransformer？管道代碼是否如下所示：'pipeline = Pipeline（[（'tfidf'，TfidfVectorizer（））]）'？ – jxn 2015-01-29 20:18:13

即時得到'plt.scatter（data2D [：，0]，data2D [：，1]，c = data.target）'具體''c = data.target'的錯誤。如果我想將散點圖的顏色調整爲由kmeans發現的簇的顏色，我應該用什麼來代替'data.target'？ 'kmeans.label_'？ #this返回一個列表。 – jxn 2015-01-29 22:24:23

繪製文檔tfidf 2D圖

回答

相關問題