2017-07-01 72 views
1
from sklearn.preprocessing import LabelEncoder as LE, OneHotEncoder as OHE 
import numpy as np 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 


oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray() 

讓我們假設第一列和第二列是分類數據。此代碼執行一個熱門編碼,但對於迴歸問題,我想在編碼分類數據後刪除第一列。在這個例子中,有兩個,我可以手動完成。但是如果你有很多明確的特徵,你會如何解決這個問題呢?使用sklearn的OneHotEncoder去除色譜柱

回答

0

您可以使用numpy的想像力索引和切下的第一列:如果你要刪除列的列表

>>> a 
array([[ 1., 0., 0., 1., 0., 0., 100.], 
     [ 0., 1., 0., 0., 1., 0., 200.], 
     [ 0., 0., 1., 0., 0., 1., 400.]]) 
>>> a[:, 1:] 
array([[ 0., 0., 1., 0., 0., 100.], 
     [ 1., 0., 0., 1., 0., 200.], 
     [ 0., 1., 0., 0., 1., 400.]]) 

,這裏是你會怎麼做:

>>> idx_to_delete = [0, 3] 
>>> indices = [i for i in range(a.shape[-1]) if i not in idx_to_delete] 
>>> indices 
[1, 2, 4, 5, 6] 
>>> a[:, indices] 
array([[ 0., 0., 0., 0., 100.], 
     [ 1., 0., 1., 0., 200.], 
     [ 0., 1., 0., 1., 400.]]) 
+0

是的,這會消除第一個分類集的第一列。但是如果我有1000個類別,並且我需要在一個熱門編碼之後刪除每個第一列? – Makaroniiii

+0

這個概念仍然是一樣的,你可以像這樣擴展到第三個維度:'a [:,:,1:]' –

+0

再次抱歉,但是我收到這個錯誤:builtins.IndexError:數組索引太多 – Makaroniiii

0

要自動執行此操作,我們會在應用一個熱門編碼之前,通過識別分類特徵中最常用的級別來獲取要刪除的索引列表。這是因爲最常見的水平最能作爲基準水平,從而可以評估其他水平的重要性。

應用一個熱門編碼之後,我們得到要保留的索引列表,並使用它刪除先前確定的列。

from sklearn.preprocessing import OneHotEncoder as OHE 
import numpy as np 
import pandas as pd 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 

def get_indices_to_drop(X_before_OH, categorical_indices_list): 
    # Returns list of index to drop after doing one hot encoding 
    # Dropping most common level within the categorical variable 
    # This is because the most common level serves best as the base level, 
    # Allowing the importance of other levels to be evaluated 
    indices_to_drop = [] 
    indices_accum = 0 
    for i in categorical_indices_list: 
     most_common = pd.Series(X_before_OH[:,i]).value_counts().index[0] 
     indices_to_drop.append(most_common + indices_accum) 
     indices_accum += len(np.unique(X_before_OH[:,i])) - 1 
    return indices_to_drop 

indices_to_drop = get_indices_to_drop(a, [0, 1]) 

oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray() 

def get_indices_to_keep(X_after_OH, index_to_drop_list): 
    return [i for i in range(X_after_OH.shape[-1]) if i not in index_to_drop_list] 

indices_to_keep = get_indices_to_keep(a, indices_to_drop) 
a = a[:, indices_to_keep]