2014-12-02 83 views
6

我試圖使用計數和tfidf作爲多項NB模型的功能。這裏是我的代碼:使用計數和tfidf作爲功能與scikit學習

text = ["this is spam", "this isn't spam"] 
labels = [0,1] 
count_vectorizer = CountVectorizer(stop_words="english", min_df=3) 

tf_transformer = TfidfTransformer(use_idf=True) 
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text) 

classifier = MultinomialNB() 
classifier.fit(combined_features, labels) 

但我發現了一個錯誤與FeatureUnion和TFIDF:

TypeError: no supported conversion for types: (dtype('S18413'),) 

任何想法,這可能是爲什麼發生?是不是可以將count和tfidf作爲特徵?

回答

8

錯誤沒有來自FeatureUnion,它從TfidfTransformer

來到您應該使用TfidfVectorizer,而不是TfidfTransformer,變壓器需要一個numpy的數組作爲輸入,而不是明文,因此,類型錯誤

您的測試句子對於Tfidf測試來說太小了,所以請嘗試使用更大的測試語句,例如:

from nltk.corpus import brown 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.pipeline import FeatureUnion 
from sklearn.naive_bayes import MultinomialNB 

# Let's get more text from NLTK 
text = [" ".join(i) for i in brown.sents()[:100]] 
# I'm just gonna assign random tags. 
labels = ['yes']*50 + ['no']*50 
count_vectorizer = CountVectorizer(stop_words="english", min_df=3) 
tf_transformer = TfidfVectorizer(use_idf=True) 
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text) 
classifier = MultinomialNB() 
classifier.fit(combined_features, labels)