文本[多等級]分類與許多輸出

要分類文本文檔它所屬的類別並且還向上分類到的類別的兩個層次。

樣品訓練集：

Description Category Level1 Level2 
The gun shooting that happened in Vegas killed two Crime | High Crime High 
Donald Trump elected as President of America Politics | High Politics High 
Rian won in football qualifier Sports | Low Sports Low 
Brazil won in football final Sports | High Sports High

初步嘗試：

我試圖創建一個分類模型，其將嘗試使用隨機森林法的類別進行分類，它給了我90％總體。

代碼1：

import pandas as pd 
#import numpy as np 

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.naive_bayes import BernoulliNB 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split 
#from stemming.porter2 import stem 

from nltk.corpus import stopwords 

from sklearn.model_selection import cross_val_score 

stop = stopwords.words('english') 
data_file = "Training_dataset_70k" 

#Reading the input/ dataset 
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8") 
data = data.dropna() 

#Removing stopwords, punctuation and stemming 
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) 
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ') 
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()])) 

train_data, test_data, train_label, test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100) 

RF = RandomForestClassifier(n_estimators=10) 
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1,3), sublinear_tf = True) 
data_features = vectorizer.fit_transform(train_data) 
RF.fit(data_features, train_label) 
test_data_feature = vectorizer.transform(test_data) 
Output_predict = RF.predict(test_data_feature) 
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label)) 
with codecs.open("out_Category.txt", "w", "utf8") as out: 
    for inp, pred, act in zip(test_data, Output_predict, test_label): 
     try: 
      out.write("{}\t{}\t{}\n".format(inp, pred, act)) 
     except: 
      continue

問題：

我想兩個級別添加到模型中，他們是Level1和Level2添加它們是當我跑分類爲1級的原因獨自我有96％的準確性。我被困在分裂訓練和測試數據集並且訓練有三個分類的模型。

是否可以創建三種分類的模型或創建三種模型？如何拆分火車和測試數據？

EDIT1： 進口串進口編解碼器進口大熊貓作爲PD 進口numpy的爲NP

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.naive_bayes import BernoulliNB 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split 
from stemming.porter2 import stem 

from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords 

from sklearn.model_selection import cross_val_score 


stop = stopwords.words('english') 

data_file = "Training_dataset_70k" 
#Reading the input/ dataset 
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8") 
data = data.dropna() 
#Removing stopwords, punctuation and stemming 
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) 
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ') 

train_data, test_data, train_label, test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100) 
RF = RandomForestClassifier(n_estimators=2) 
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1,3), sublinear_tf = True) 
data_features = vectorizer.fit_transform(train_data) 
print len(train_data), len(train_label) 
print train_label 
RF.fit(data_features, train_label) 
test_data_feature = vectorizer.transform(test_data) 
#print test_data_feature 
Output_predict = RF.predict(test_data_feature) 
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label)) 
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out: 
    for inp, pred, act in zip(test_data, Output_predict, test_label): 
     try: 
      out.write("{}\t{}\t{}\n".format(inp, pred, act)) 
     except: 
      continue

來源

2017-08-25 The6thSense

你能否澄清兩層應該是什麼？在您提供的樣本訓練集中，您的類別類似於「犯罪|高」，然後您的水平只是類別中的第一個和第二個單詞（因此它不提供任何新信息）。另外，只是爲了確保 - 類別總是由兩個單詞組成？ –

@MiriamFarber yes類別始終包含由管道分隔的兩個單詞。添加level1和level2的原因是我對level1的準確性越來越高，所以即使類別錯誤，它也會減少向下的過程。 – The6thSense

好了，只要確保 - 當你運行一個目標的模型時，如果此目標是類別列，則獲得90％成功，如果此目標是1級列，則獲得96％成功，並且要構建一個模型，你有3個目標（這三個列對應描述，1級和2級），對嗎？ –

的scikit學習隨機森林分類本身就支持多路輸出（見this example）。因此，您不需要創建三個單獨的模型。

從RandomForestClassifier.fit文檔，輸入到fit功能是：

X : array-like or sparse matrix of shape = [n_samples, n_features]

y : array-like, shape = [n_samples] or [n_samples, n_outputs]

因此，需要作爲輸入大小爲N×3的陣列y（您的標籤），以您的RandomForestClassifier。爲了分割你的訓練和測試集，你可以這樣做：

train_data, test_data, train_label, test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)

你train_label和test_label應該是大小爲N×3，你可以用它來適應你的模型比較你的預測（NB陣列：我沒有在這裏測試它，你可能需要做一些轉換）。

來源

2017-08-31 05:44:54 nbeuchat

我會檢查這個與我的程序，並會讓你知道 – The6thSense

@ The6thSense它的工作？ – nbeuchat

我非常抱歉，我還沒有嘗試過，我不接近我的系統。我一定會明天檢查一下，並會盡快通知你。謝謝 – The6thSense

文本[多等級]分類與許多輸出

回答

相關問題