問題陳述:文本[多等級]分類與許多輸出
要分類文本文檔它所屬的類別並且還向上分類到的類別的兩個層次。
樣品訓練集:
Description Category Level1 Level2
The gun shooting that happened in Vegas killed two Crime | High Crime High
Donald Trump elected as President of America Politics | High Politics High
Rian won in football qualifier Sports | Low Sports Low
Brazil won in football final Sports | High Sports High
初步嘗試:
我試圖創建一個分類模型,其將嘗試使用隨機森林法的類別進行分類,它給了我90%總體。
代碼1:
import pandas as pd
#import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))
train_data, test_data, train_label, test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1,3), sublinear_tf = True)
data_features = vectorizer.fit_transform(train_data)
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
問題:
我想兩個級別添加到模型中,他們是Level1和Level2添加它們是當我跑分類爲1級的原因獨自我有96%的準確性。我被困在分裂訓練和測試數據集並且訓練有三個分類的模型。
是否可以創建三種分類的模型或創建三種模型?如何拆分火車和測試數據?
EDIT1: 進口串 進口編解碼器 進口大熊貓作爲PD 進口numpy的爲NP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv(data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1,3), sublinear_tf = True)
data_features = vectorizer.fit_transform(train_data)
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
你能否澄清兩層應該是什麼?在您提供的樣本訓練集中,您的類別類似於「犯罪|高」,然後您的水平只是類別中的第一個和第二個單詞(因此它不提供任何新信息)。另外,只是爲了確保 - 類別總是由兩個單詞組成? –
@MiriamFarber yes類別始終包含由管道分隔的兩個單詞。添加level1和level2的原因是我對level1的準確性越來越高,所以即使類別錯誤,它也會減少向下的過程。 – The6thSense
好了,只要確保 - 當你運行一個目標的模型時,如果此目標是類別列,則獲得90%成功,如果此目標是1級列,則獲得96%成功,並且要構建一個模型,你有3個目標(這三個列對應描述,1級和2級),對嗎? –