Python中的邏輯迴歸和交叉驗證（使用sklearn）

我想通過logistic迴歸（這不是問題）來解決給定數據集上的分類問題。爲了避免過度配合，我試圖通過交叉驗證來實現它（這裏是問題）：我缺少一些東西來完成程序。我的目的是確定準確度。Python中的邏輯迴歸和交叉驗證（使用sklearn）

但讓我具體。這是我做了什麼：

我分裂成組列車集和測試集
我定義使用
我用cross_val_predict方法（在sklearn.cross_validation）的logregression預測模型作出預測
最後，我測量精度

下面是代碼：

import pandas as pd 
import numpy as np 
import seaborn as sns 
from sklearn.cross_validation import train_test_split 
from sklearn import metrics, cross_validation 
from sklearn.linear_model import LogisticRegression 

# read training data in pandas dataframe 
data = pd.read_csv("./dataset.csv", delimiter=';') 
# last column is target, store in array t 
t = data['TARGET'] 
# list of features, including target 
features = data.columns 
# item feature matrix in X 
X = data[features[:-1]].as_matrix() 
# remove first column because it is not necessary in the analysis 
X = np.delete(X,0,axis=1) 
# divide in training and test set 
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0) 

# define method 
logreg=LogisticRegression() 

# cross valitadion prediction 
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10) 
print(metrics.accuracy_score(t_train, predicted))

個

我的問題：

從我的理解測試集不應被視爲直到最後和交叉驗證應進行培訓設置。這就是爲什麼我在cross_val_predict方法中插入X_train和t_train的原因。 Thuogh，我得到一個錯誤的說法：

ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]

，其中6016是在整個數據集的樣本數量，而4812是已被分割
樣本數據集中後的訓練集數
之後，我不知道該怎麼辦。我的意思是：什麼時候X_test和t_test進場了？在交叉驗證以及如何獲得最終的準確性之後，我不明白我應該如何使用它們。

獎金問題：我也想的交叉驗證的各步驟中執行縮放和減少維數（通過特徵選擇或PCA）的。我怎樣才能做到這一點？我已經看到，定義管道可以幫助擴展，但我不知道如何將其應用於第二個問題。

我真的很感激任何幫助:-)

來源

2017-02-17 Harnak

這是工作代碼在樣本數據框上測試。你的代碼中的第一個問題是目標數組不是np.array。您的功能中也不應有目標數據。下面我將說明如何使用train_test_split手動分割訓練和測試數據。我還展示瞭如何使用包裝cross_val_score來自動分割，擬合和評分。

random.seed(42) 
# Create example df with alphabetic col names. 
alphabet_cols = list(string.ascii_uppercase)[:26] 
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)), 
        columns=alphabet_cols) 
df['Target'] = df['A'] 
df.drop(['A'], axis=1, inplace=True) 
print(df.head()) 
y = df.Target.values # df['Target'] is not an np.array. 
feature_cols = [i for i in list(df.columns) if i != 'Target'] 
X = df.ix[:, feature_cols].as_matrix() 
# Illustrated here for manual splitting of training and testing data. 
X_train, X_test, y_train, y_test = \ 
    model_selection.train_test_split(X, y, test_size=0.2, random_state=0) 

# Initialize model. 
logreg = linear_model.LinearRegression() 

# Use cross_val_score to automatically split, fit, and score. 
scores = model_selection.cross_val_score(logreg, X, y, cv=10) 
print(scores) 
print('average score: {}'.format(scores.mean()))

輸出

 B C D E F G H I J K ... Target 
0 20 33 451 0 420 657 954 156 200 935 ... 253 
1 427 533 801 183 894 822 303 623 455 668 ... 421 
2 148 681 339 450 376 482 834 90 82 684 ... 903 
3 289 612 472 105 515 845 752 389 532 306 ... 639 
4 556 103 132 823 149 974 161 632 153 782 ... 347 

[5 rows x 26 columns] 
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399 0.0328 
-0.0409] 
average score: -0.04258093018969249

有用的參考文獻：

來源

2017-02-17 19:36:22

非常感謝，男人！我修復了代碼，現在它可以工作。功能中的目標並不是一個真正的問題，因爲我的代碼中的-1被拿走了，因爲它是最後一列。所以真正的問題是，事實上目標不是np.array，正如你指出的那樣（我說，我真的不明白它與機器返回的大小錯誤有什麼神祕的關係）。您是否對如何完成這個過程有所瞭解，即如何進行最終測試？我對我現在應該做的事情有點困惑。 – Harnak

我修改了我的答案，使用'model_selection.cross_val_score'包含了一個完整的過程。至於尺寸錯誤，在pd.dataframes和np.ndarrays之間工作可能會很痛苦。您可以使用'x.shape'打印每個模糊的故障排除。學習這些東西的最好方法是挖掘sklearn文檔和教程。 – 2017-02-17 20:45:04

我不確定我是否理解正確。那麼，使用cross_val_score可以使之前的分割變得不必要？我的意思是：不應該只在訓練集上進行交叉驗證，而不是在整套上進行交叉驗證？或者我可能錯過了交叉驗證的重點。 – Harnak

請看documentation of cross-validation at scikit更瞭解它。

另外您還錯誤地使用了cross_val_predict。它將在內部調用您提供的cv（cv = 10）將提供的數據（即您的情況中的X_train，t_train）再次分解爲訓練和測試，將估計器擬合到列車上並預測測試中的數據。

現在上車數據的X_test，y_test，你應該先滿足您的estimtor的使用（cross_val_predict將不適合），然後用它來預測測試數據，然後計算精度。

簡單的代碼片段來描述上述（從您的代碼借款）（請閱讀註釋，並詢問是否也不懂）：

# item feature matrix in X 
X = data[features[:-1]].as_matrix() 
# remove first column because it is not necessary in the analysis 
X = np.delete(X,0,axis=1) 
# divide in training and test set 
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0) 

# Until here everything is good 
# You keep away 20% of data for testing (test_size=0.2) 
# This test data should be unseen by any of the below methods 

# define method 
logreg=LogisticRegression() 

# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved) 
#cross valitadion prediction 
#This cross validation prediction will print the predicted values of 't_train' 
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10) 
# internal working of cross_val_predict: 
    #1. Get the data and estimator (logreg, X_train, t_train) 
    #2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts?? 
    #3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv 
    #4. Use X_cv_train, t_cv_train for fitting 'logreg' 
    #5. Predict on X_cv_test (No use of t_cv_test) 
    #6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing. 

# So here you are correctly comparing 'predicted' and 't_train' 
print(metrics.accuracy_score(t_train, predicted)) 

# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting. 

# Now what to do about the X_test and t_test above. 
# Actually the correct preference for metrics is this X_test and t_train 
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test 

logreg.fit(X_train, t_train) 
t_pred = logreg(X_test) 

# Here is the final accuracy 
print(metrics.accuracy_score(t_test, t_pred)) 
# If this accuracy is good, then your model is good.

如果你有較少的數據或不想要將數據分割成培訓和測試，那麼你應該使用的方法由@fuzzyhedge

# Use cross_val_score on your all data 
scores = model_selection.cross_val_score(logreg, X, y, cv=10) 

# 'cross_val_score' will almost work same from steps 1 to 4 
    #5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test. 
    #6. Repeat steps 1 to 5 for cv_iterations = 10 
    #7. Return array of accuracies calculated in step 5. 

# Find out average of returned accuracies to see the model performance 
scores = scores.mean()

注意的建議 - 也cross_validation最好用gridsearch用來找出估計的參數，針對給定數據執行最佳操作。例如，使用LogisticRegression它定義了許多參數。但是，如果使用

logreg = LogisticRegression()

將僅使用默認參數初始化模型。也許參數值不同

logreg = LogisticRegression(penalty='l1', solver='liblinear')

可能對您的數據執行效果更好。這個搜索更好的參數是gridsearch。

現在至於你的第二部分scaling, dimension reductions等使用管道。您可以參考documentation of pipeline和下面的例子：

隨時聯繫我，如果需要任何幫助。

來源

2017-02-18 03:28:42

關於網格搜索的好處。 – 2017-02-18 04:54:27

謝謝。非常完整和有用的答案！是的，我試圖找出sklearn文檔中的一些東西，但我仍然對如何結合之前的分割和交叉驗證感到困惑。現在它更清晰了 – Harnak

Python中的邏輯迴歸和交叉驗證（使用sklearn）

回答

相關問題