機器學習文本分類

-1

我正在用Python處理一個迷你項目分類文本。
這個想法很簡單：我們有一個句子的語料庫，分別屬於J. Chirac和Mitterrand（2個法蘭西共和國前總統（與相關標籤）
目標是建立一個模型，預測屬於不同的句子。對於類（標籤），它具有Mitterand的「M」和Chirac的「C」，正確地在我的程序中，我認爲M == > -1和C ==> 1。
最後，我在我的數據集上應用了一個聚類算法，我對新數據做了預測（測試）
這裏的問題是，在評估我的系統性能之後，我得到了非常低的分數，儘管我已經使用了幾種方法來增加（stopwords，bigrams，smoothing）。。）機器學習文本分類

如果有人對我有另一種想法或建議來提高我係統的性能，我會非常滿意的。

我會在下面附上一些我的代碼。

在下面的代碼我選擇了我stopliste和我刪除那些不是很重要和分路器生產我的語料庫中的詞，我用的雙字母組：

stoplist = set('le la les de des à un une en au ne ce d l c s je tu il que qui mais quand'.split()) 
stoplist.add('') 
splitters = u'; |, |\*|\. | |\'|' 
liste = (re.split(splitters, doc.lower()) for doc in alltxts) # generator = pas de place en memoire 
dictionary = corpora.Dictionary([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste) # bigrams 
print len(dictionary) 
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id] 
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq < 10 ] 
dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once 
dictionary.compactify() # remove gaps in id sequence after words that were removed 
print len(dictionary) 
liste = (re.split(splitters, doc.lower()) for doc in alltxts) # ATTENTION: quand le générator a déjà servi, il ne se remet pas au début => le re-créer pour plus de sécurité 
alltxtsBig = ([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste) 
corpusBig = [dictionary.doc2bow(text) for text in alltxtsBig]

，在這裏，我產生了一個文集我測試數據集：

liste_test = (re.split(splitters, doc.lower()) for doc in alltxts_test) 
alltxtsBig_test = ([u"{0}_{1}".format(l[i],l[i+1]) for i in xrange(len(l)-1)] for l in liste_test) 
corpusBig_test = [dictionary.doc2bow(text) for text in alltxtsBig_test] 
and here I am doing the processing of these data has a numpy matrix, and I apply the algorithm on data, and I make the prediction on test data: 


dataSparse = gensim.matutils.corpus2csc(corpusBig) 
dataSparse_test = gensim.matutils.corpus2csc(corpusBig_test) 
import sklearn.feature_extraction.text as txtTools #.TfidfTransformer 
t = txtTools.TfidfTransformer() 
t.fit(dataSparse.T) 
data2 = t.transform(dataSparse.T) 
data_test = t.transform(dataSparse_test.T) 
nb_classifier = MultinomialNB().fit(data2, labs) 
y_nb_predicted = nb_classifier.predict(data_test)

編輯：
我的系統的性能給出了0.28的數值。通常如果系統是有效的，它將會超過0.6。
我工作在一個文件米勒的句子，我聲明gensim，我沒有粘貼所有的代碼，因爲它很長，我的問題是，如果有其他方法提高系統性能，我用bigrams，平滑..就這樣。

來源

2014-10-10 ANAS89

歡迎使用stackoverflow。首先，你是肯定你表現不佳？你甚至不會說你得到了什麼樣的表現，但是如果（如你似乎所說的）你試圖以一句單個句子爲基礎來識別作者，我不希望它有任何類型的可能的可靠性。作者身份通常在更長的文本上完成。

恐怕你的代碼是不完整的（gensim在哪裏定義的？所有這些庫函數做了什麼？），並且太容易跟蹤。但是，您是否將文本中的所有（非停用詞）bigrams用作分類器的功能？這是很多功能，它們都是相同的（bigrams）。您可以嘗試在混合中添加一些不同類型的功能，和/或更有選擇性地使用bigram功能以避免過度訓練。你應該閱讀以瞭解哪些事情可能會起作用 - 作者身份識別不是一項新任務。

你的問題有點太廣泛，無法有效回答，因爲有太多可能的答案。但是當你更多地處理這些問題時，請堅持並提出更具體的問題。祝你好運！

來源

2014-10-11 00:03:01 alexis

機器學習文本分類

回答

相關問題