使用scikit-learn時出現屬性錯誤

我正在嘗試使用scikit使用餘弦相似性來查找類似的問題。我正在試圖在互聯網上提供這個示例代碼。 Link1和Link2 使用scikit-learn時出現屬性錯誤

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer 
from nltk.corpus import stopwords 
import numpy as np 
import numpy.linalg as LA 

train_set = ["The sky is blue.", "The sun is bright."] 
test_set = ["The sun in the sky is bright."] 
stopWords = stopwords.words('english') 

vectorizer = CountVectorizer(stop_words = stopWords) 
transformer = TfidfTransformer() 

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() 
trainVectorizerArray = vectorizer. 
testVectorizerArray = vectorizer.transform(test_set).toarray() 
print 'Fit Vectorizer to train set', trainVectorizerArray 
print 'Transform Vectorizer to test set', testVectorizerArray 
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) 

for vector in trainVectorizerArray: 
    print vector 
    for testV in testVectorizerArray: 
     print testV 
     cosine = cx(vector, testV) 
     print cosine 

transformer.fit(trainVectorizerArray) 
print transformer.transform(trainVectorizerArray).toarray() 

transformer.fit(testVectorizerArray) 
tfidf = transformer.transform(testVectorizerArray) 
print tfidf.todense()

我總是得到這個錯誤

Traceback (most recent call last): 
File "C:\Users\Animesh\Desktop\NLP\ngrams2.py", line 14, in <module> 
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() 
File "C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn \feature_extraction\text.py", line 740, in fit_transform 
raise ValueError("empty vocabulary; training set may have" 
ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).

我甚至可以檢查代碼上this link。我有錯誤AttributeError: 'CountVectorizer' object has no attribute 'vocabulary'。

如何解決這個問題？

我在Windows 7 32位和scikit_learn 0.13.1上使用Python 2.7.3。

來源

2013-03-05 Animesh Pandey

由於我正在運行開發（0.14之前版本）版本，其中feature_extraction.text模塊被徹底檢查，所以我沒有收到相同的錯誤消息。但我懷疑，你可以解決這一問題：

vectorizer = CountVectorizer(stop_words=stopWords, min_df=1)

的min_df參數使CountVectorizer扔掉髮生在極少數人的文件（因爲它不會有任何預測值）任何條款。默認情況下，它被設置爲2，這意味着所有的術語都會被扔掉，所以你會得到一個空的詞彙表。

來源

2013-03-05 10:11:02

哦！這解決了這個問題..但是，你能告訴我什麼是詞彙功能...當我嘗試使用這個功能時，它爲什麼會給出屬性錯誤 – 2013-03-05 10:33:08

@AnimeshPandey：錯誤消息中正確的是：「空的詞彙;培訓集合可能只包含停用詞或min_df（resp.max_df）可能太高（或太低）。「正如我所解釋的，默認設置「min_df = 2」太低，因爲您只有兩個文檔。（請注意，tf-idf在這麼少的文檔中工作不正常。） – 2013-03-05 10:33:45

在調用fit方法時（除非用戶提供了構造函數參數），將提取帶有尾部'_'的'vocabulary_'。請參閱[文檔]（http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction）。 – ogrisel 2013-03-05 10:34:00

使用scikit-learn時出現屬性錯誤

回答

相關問題