我試圖在Python中重現以下R結果。在這種特殊情況下,R預測技能低於Python技能,但通常我的經驗並非如此(因此希望以Python重現結果的原因),因此請忽略此處的細節。使用Iris數據集在Python中重現R的LASSO/Logistic迴歸結果使用Iris數據集
目標是預測花種('雜色'0或'維吉尼卡'1)。我們有100個標籤樣本,每個樣本包含4個花的特徵:萼片長度,萼片寬度,花瓣長度,花瓣寬度。我將數據分爲訓練(60%的數據)和測試集(40%的數據)。將10倍交叉驗證應用於訓練集以搜索最佳lambda(在scikit-learn中優化的參數爲「C」)。
我在R中使用glmnet,alpha設置爲1(用於LASSO懲罰),python,scikit-learn的LogisticRegressionCV函數與「liblinear」求解器(唯一可用於L1懲罰的求解器) 。兩種語言的交叉驗證中使用的評分指標是相同的。然而不知怎麼的,模型結果是不同的(對每個特徵發現的截距和係數變化很大)。
R代碼裏面
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),] # (60, 4)
y_train = y[0:(w * n_sample)] # (60,)
X_test = X[((w * n_sample)+1):n_sample,] # (40, 4)
y_test = y[((w * n_sample)+1):n_sample] # (40,)
# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
# best lambda
print(best_s)
# 0.04136537
# fraction correct
print(sum(y_test==pred)/NROW(pred))
# 0.75
# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept) -14.680479
#Sepal.Length 0
#Sepal.Width 0
#Petal.Length 1.181747
#Petal.Width 4.592025
Python代碼
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)] # (60, 4)
y_train = y[:int(w * n_sample)] # (60,)
X_test = X[int(w * n_sample):] # (40, 4)
y_test = y[int(w * n_sample):] # (40,)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.775
print clf.intercept_ #-1.83569557
print clf.coef_ # [ 0, 0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_ # optimal lambda: 0.35938137
非常感謝。然而train_test_split函數似乎很方便(請參閱我對Grr的回覆)我不確定這是否是兩種語言之間差異的原因。我將嘗試在兩者之間實現平衡分割(在R和Python中),然後更新我的初始文章。 –
我建議創建兩個文件,一個用於您的訓練集,另一個用於測試集,並將這些文件讀入Python和R.這是確保您的數據正確分割的最安全的方法。 – ncfirth