2015-01-26 111 views
5

我想繪製一個2d圖,其中x軸爲term,y軸爲TFIDF評分(或文檔ID)作爲我的句子列表。我使用scikit learn的fit_transform()來獲取scipy矩陣,但我不知道如何使用該矩陣來繪製圖表。我試圖得到一個陰謀,看看我的句子可以用kmeans進行分類。繪製文檔tfidf 2D圖

這裏是fit_transform(sentence_list)輸出:

(文檔ID,項數)TFIDF分數

(0, 1023) 0.209291711271 
    (0, 924) 0.174405532933 
    (0, 914) 0.174405532933 
    (0, 821) 0.15579574484 
    (0, 770) 0.174405532933 
    (0, 763) 0.159719994016 
    (0, 689) 0.135518787598 

這裏是我的代碼:

 sentence_list=["Hi how are you", "Good morning" ...] 
     vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore') 
     vectorized=vectorizer.fit_transform(sentence_list) 
     num_samples, num_features=vectorized.shape 
     print "num_samples: %d, num_features: %d" %(num_samples,num_features) 
     num_clusters=10 
     km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1) 
     km.fit(vectorized) 
     PRINT km.labels_ # Returns a list of clusters ranging 0 to 10 

感謝,

+0

爲您做以下工作?它應該如果你只看一個簡單的二維圖。 http://matplotlib.org/examples/pylab_examples/simple_plot.html – 2015-01-26 23:35:41

回答

15

當您使用Bag單詞,你的每個句子都會被表示在一個長度等於詞彙量的高維空間中。如果你想在2D表示這一點,你需要使用PCA,以減少尺寸,例如由兩個部分組成:

from sklearn.datasets import fetch_20newsgroups 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 
from sklearn.decomposition import PCA 
from sklearn.pipeline import Pipeline 
import matplotlib.pyplot as plt 

newsgroups_train = fetch_20newsgroups(subset='train', 
             categories=['alt.atheism', 'sci.space']) 
pipeline = Pipeline([ 
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
])   
X = pipeline.fit_transform(newsgroups_train.data).todense() 

pca = PCA(n_components=2).fit(X) 
data2D = pca.transform(X) 
plt.scatter(data2D[:,0], data2D[:,1], c=data.target) 
plt.show()    #not required if using ipython notebook 

data2d

現在你可以例如計算和繪製這個數據集羣進入:

from sklearn.cluster import KMeans 

kmeans = KMeans(n_clusters=2).fit(X) 
centers2D = pca.transform(kmeans.cluster_centers_) 

plt.hold(True) 
plt.scatter(centers2D[:,0], centers2D[:,1], 
      marker='x', s=200, linewidths=3, c='r') 
plt.show()    #not required if using ipython notebook 

enter image description here

+0

是的,這就是它。謝謝! – jxn 2015-01-29 01:46:33

+0

我可以只使用tfidfvectorizer而不是countvectorizer然後tfidftransformer? 管道代碼是否如下所示:'pipeline = Pipeline([('tfidf',TfidfVectorizer())])'? – jxn 2015-01-29 20:18:13

+3

即時得到'plt.scatter(data2D [:,0],data2D [:,1],c = data.target)'具體''c = data.target'的錯誤。如果我想將散點圖的顏色調整爲由kmeans發現的簇的顏色,我應該用什麼來代替'data.target'? 'kmeans.label_'? #this返回一個列表。 – jxn 2015-01-29 22:24:23