0

我試圖複製在Python的Binary Classification: Twitter sentiment analysis稀疏矩陣和數據幀在Python熊貓

這個項目這些步驟是:

Step 1: Get data 
Step 2: Text preprocessing using R 
Step 3: Feature engineering 
Step 4: Split the data into train and test 
Step 5: Train prediction model 
Step 6: Evaluate model performance 
Step 7: Publish prediction web service 

我在Step 4現在,但我想我無法繼續。

import pandas 
import re 
from sklearn.feature_extraction import FeatureHasher 

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 

from sklearn import cross_validation 

#read the dataset of tweets 

header_row=['sentiment','tweetid','date','query', 'user', 'text'] 
train = pandas.read_csv("training.1600000.processed.noemoticon.csv",names=header_row) 

#keep only the right columns 

train = train[["sentiment","text"]] 

#remove puctuation, special characters, numbers and lower case the text 

def remove_spch(text): 

    return re.sub("[^a-z]", ' ', text.lower()) 

train['text'] = train['text'].apply(remove_spch) 


#Feature Hashing 

def tokens(doc): 
    """Extract tokens from doc. 

    This uses a simple regex to break strings into tokens. 
    """ 
    return (tok.lower() for tok in re.findall(r"\w+", doc)) 

n_features = 2**18 
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True) 
X = hasher.transform(tokens(d) for d in train['text']) 

#Feature Selection and choose the best 20.000 features using Chi-Square 

X_new = SelectKBest(chi2, k=20000).fit_transform(X, train['sentiment']) 

#Using Stratified KFold, split my data to train and test 

skf = cross_validation.StratifiedKFold(X_new, n_folds=2) 

我相信,最後一行是錯誤的,因爲它只包含20.000功能,而不是從大熊貓的Sentiment列。我如何「加入」稀疏矩陣X_new與數據幀train,將其包含在cross_validation然後將其用於分類器?

回答

0

您應該將您的類標籤傳遞給StratifiedKFold,然後使用skf作爲迭代器,在每次迭代時它將產生測試集和訓練集的索引,您可以使用它們來分離數據集。

看代碼示例在官方scikit學習文檔: StratifiedKFold

+0

通過你的答案,我發現一個問題是在另一個地方,所以我打開另一個問題。 – Tasos