我想將填充字符串的列轉換爲分類變量,以便我可以運行統計信息。但是,我對這種轉換有困難,因爲我對Python相當陌生。將列中的字符串轉換爲分類變量
這裏是我的代碼示例:
# Open txt file and provide column names
data = pd.read_csv('sample.txt', sep="\t", header = None,
names = ["Label", "I1", "I2", "C1", "C2"])
# Convert I1 and I2 to continuous, numeric variables
data = data.apply(lambda x: pd.to_numeric(x, errors='ignore'))
# Convert Label, C1, and C2 to categorical variables
data["Label"] = pd.factorize(data.Label)[0]
data["C1"] = pd.factorize(data.C1)[0]
data["C2"] = pd.factorize(data.C2)[0]
# Split the predictors into training/testing sets
predictors = data.drop('Label', 1)
msk = np.random.rand(len(predictors)) < 0.8
predictors_train = predictors[msk]
predictors_test = predictors[~msk]
# Split the response variable into training/testing sets
response = data['Label']
ksm = np.random.rand(len(response)) < 0.8
response_train = response[ksm]
response_test = response[~ksm]
# Logistic Regression
from sklearn import linear_model
# Create logistic regression object
lr = linear_model.LogisticRegression()
# Train the model using the training sets
lr.fit(predictors_train, response_train)
不過,我得到這個錯誤:
ValueError: could not convert string to float: 'ec26ad35'
的ec26ad35
值從分類變量C1
和C2
的字符串。我不確定發生了什麼:我沒有將字符串轉換爲分類變量嗎?爲什麼錯誤說他們仍然是字符串?
使用data.head(30)
,這是我的數據:
>> data[["Label", "I1", "I2", "C1", "C2"]].head(30)
Label I1 I2 C1 C2
0 0 1.0 1 68fd1e64 80e26c9b
1 0 2.0 0 68fd1e64 f0cf0024
2 0 2.0 0 287e684f 0a519c5c
3 0 NaN 893 68fd1e64 2c16a946
4 0 3.0 -1 8cf07265 ae46a29d
5 0 NaN -1 05db9164 6c9c9cf3
6 0 NaN 1 439a44a4 ad4527a2
7 1 1.0 4 68fd1e64 2c16a946
8 0 NaN 44 05db9164 d833535f
9 0 NaN 35 05db9164 510b40a5
10 0 NaN 2 05db9164 0468d672
11 0 0.0 6 05db9164 9b5fd12f
12 1 0.0 -1 241546e0 38a947a1
13 1 NaN 2 be589b51 287130e0
14 0 0.0 51 5a9ed9b0 80e26c9b
15 0 NaN 2 05db9164 bc6e3dc1
16 1 1.0 987 68fd1e64 38d50e09
17 0 0.0 1 8cf07265 7cd19acc
18 0 0.0 24 05db9164 f0cf0024
19 0 7.0 102 3c9d8785 b0660259
20 1 NaN 47 1464facd 38a947a1
21 0 0.0 1 05db9164 09e68b86
22 0 NaN 0 05db9164 38a947a1
23 0 NaN 9 05db9164 08d6d899
24 0 0.0 1 5a9ed9b0 3df44d94
25 0 NaN 4 5a9ed9b0 09e68b86
26 1 0.0 1 8cf07265 942f9a8d
27 1 0.0 20 68fd1e64 38a947a1
28 1 0.0 78 68fd1e64 1287a654
29 1 3.0 0 05db9164 90081f33
編輯:從分裂到dataframes訓練和測試數據集後,填充缺失的數據有錯誤了。不知道這裏發生了什麼。
# Impute missing data
>> from sklearn.preprocessing import Imputer
>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>> predictors_train = imp.fit_transform(predictors_train)
TypeError: float() argument must be a string or a number, not 'function'
我不知道該變量是什麼,但對於分類變量,需要在線性迴歸中使用[虛擬變量](http://stackoverflow.com/a/37144372/2285236)。 – ayhan
如果您從數據框中發佈樣本,我也可以爲其提供熊貓解決方案。 – ayhan
@Ayhan它已經結束了。 –