使用sklearn的OneHotEncoder去除色譜柱

from sklearn.preprocessing import LabelEncoder as LE, OneHotEncoder as OHE 
import numpy as np 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 


oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray()

讓我們假設第一列和第二列是分類數據。此代碼執行一個熱門編碼，但對於迴歸問題，我想在編碼分類數據後刪除第一列。在這個例子中，有兩個，我可以手動完成。但是如果你有很多明確的特徵，你會如何解決這個問題呢？使用sklearn的OneHotEncoder去除色譜柱

來源

2017-07-01 Makaroniiii

您可以使用numpy的想像力索引和切下的第一列：如果你要刪除列的列表

>>> a 
array([[ 1., 0., 0., 1., 0., 0., 100.], 
     [ 0., 1., 0., 0., 1., 0., 200.], 
     [ 0., 0., 1., 0., 0., 1., 400.]]) 
>>> a[:, 1:] 
array([[ 0., 0., 1., 0., 0., 100.], 
     [ 1., 0., 0., 1., 0., 200.], 
     [ 0., 1., 0., 0., 1., 400.]])

，這裏是你會怎麼做：

>>> idx_to_delete = [0, 3] 
>>> indices = [i for i in range(a.shape[-1]) if i not in idx_to_delete] 
>>> indices 
[1, 2, 4, 5, 6] 
>>> a[:, indices] 
array([[ 0., 0., 0., 0., 100.], 
     [ 1., 0., 1., 0., 200.], 
     [ 0., 1., 0., 1., 400.]])

來源

2017-07-01 19:14:15

是的，這會消除第一個分類集的第一列。但是如果我有1000個類別，並且我需要在一個熱門編碼之後刪除每個第一列？ – Makaroniiii

這個概念仍然是一樣的，你可以像這樣擴展到第三個維度：'a [:,：，1：]' –

再次抱歉，但是我收到這個錯誤：builtins.IndexError：數組索引太多 – Makaroniiii

要自動執行此操作，我們會在應用一個熱門編碼之前，通過識別分類特徵中最常用的級別來獲取要刪除的索引列表。這是因爲最常見的水平最能作爲基準水平，從而可以評估其他水平的重要性。

應用一個熱門編碼之後，我們得到要保留的索引列表，並使用它刪除先前確定的列。

from sklearn.preprocessing import OneHotEncoder as OHE 
import numpy as np 
import pandas as pd 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 

def get_indices_to_drop(X_before_OH, categorical_indices_list): 
    # Returns list of index to drop after doing one hot encoding 
    # Dropping most common level within the categorical variable 
    # This is because the most common level serves best as the base level, 
    # Allowing the importance of other levels to be evaluated 
    indices_to_drop = [] 
    indices_accum = 0 
    for i in categorical_indices_list: 
     most_common = pd.Series(X_before_OH[:,i]).value_counts().index[0] 
     indices_to_drop.append(most_common + indices_accum) 
     indices_accum += len(np.unique(X_before_OH[:,i])) - 1 
    return indices_to_drop 

indices_to_drop = get_indices_to_drop(a, [0, 1]) 

oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray() 

def get_indices_to_keep(X_after_OH, index_to_drop_list): 
    return [i for i in range(X_after_OH.shape[-1]) if i not in index_to_drop_list] 

indices_to_keep = get_indices_to_keep(a, indices_to_drop) 
a = a[:, indices_to_keep]

來源

2017-09-18 01:11:23 tnbalankura

使用sklearn的OneHotEncoder去除色譜柱

回答

相關問題