如何使用LinearRegression與分類變量sklearn

-1

我想執行一些速度比較測試Python vs R和掙扎與問題 - 線性迴歸sklearn與分類變量。如何使用LinearRegression與分類變量sklearn

代碼R：

# Start the clock! 
ptm <- proc.time() 
ptm 

test_data = read.csv("clean_hold.out.csv") 

# Regression Model 
model_liner = lm(test_data$HH_F ~ ., data = test_data) 

# Stop the clock 
new_ptm <- proc.time() - ptm

代碼的Python：

import pandas as pd 
import time 

from sklearn.linear_model import LinearRegression 
from sklearn.feature_extraction import DictVectorizer 

start = time.time() 

test_data = pd.read_csv("./clean_hold.out.csv") 

x_train = [col for col in test_data.columns[1:] if col != 'HH_F'] 
y_train = ['HH_F'] 

model_linear = LinearRegression(normalize=False) 
model_linear.fit(test_data[x_train], test_data[y_train])

，但它不適合我

return X.astype(np.float32 if X.dtype == np.int32 else np.float64) ValueError: could not convert string to float: Bee True

工作，我嘗試了另一種方法

test_data = pd.read_csv("./clean_hold.out.csv").to_dict() 
v = DictVectorizer(sparse=False) 
X = v.fit_transform(test_data)

不過，我逮住另一個錯誤：

File "C:\Anaconda32\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 258, in transform Xa[i, vocab[f]] = dtype(v) TypeError: float() argument must be a string or a number

我不明白的Python應該怎麼解決這個問題？數據的

例子： http://screencast.com/t/hYyyu7nU9hQm

來源

2015-10-05 SpanishBoy

如果沒有您的源數據，我們將很難調試任何東西。您也可以考慮CrossValidated是否更適合。 – TARehman

添加了非常小的數據部分。請指教 – SpanishBoy

我認爲你可能從中受益http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-在我們的例子中，數據的截圖並不是超級有用的。 – TARehman

我要在使用fit之前做一些編碼。

有幾類，可用於：

LabelEncoder : turn your string into incremental value 
OneHotEncoder : use One-of-K algorithm to transform your String into integer

我想有一個可擴展的解決方案，但沒有得到任何答覆。我選擇了將所有字符串二進制化的OneHotEncoder。這是非常有效的，但如果你有很多不同的字符串矩陣將增長得非常快，並且需要記憶。

來源

2015-10-23 10:01:09 SpanishBoy

你可以顯示一些示例代碼來將字符串轉換爲OneHotEncoder嗎？ – LKM

如何使用LinearRegression與分類變量sklearn

回答

相關問題