2015-12-02 65 views
2

我想展開這個數據幀的「特徵」列,以便創建一個新的數據幀,這些特徵成爲列名。在熊貓中形成一個稀疏特徵矩陣數據幀

例如。由此看來,

Raw matrix

對此,

Features matrix

我的解決方案作品,但我不認爲這是非常好的,因爲有很多的for循環。也許有更好的方法可以利用Pandas.DataFrame類的特性?

的代碼生成功能矩陣如下,

def feature_data_frame_by_exploding_column(input_df, col_name): 

    # Create data frame with same columns minus the column you want to explode 
    df = input_df.copy() 
    del df[col_name] 

    # The items that you want to become new features 
    all_new_features = [] 
    new_feature_list = input_df[col_name].values 
    for ingred_list in new_feature_list: 
     all_new_features.extend(ingred_list) # Extend vs append! 

    # Add new features as columns of zeros 
    for feature in all_new_features: 
     df[feature] = 0 

    # For each row in data frame set values that need to be 1 
    for index in df.index: 
     ingreds_arr = new_feature_list[index] 
     df.loc[index, ingreds_arr] = 1 

    return df 

df = pd.DataFrame(columns = ["id", "features"]) 
df['id'] = [0,1] 
df['features'] = [["A", "B"], ["C", "D"]] 
df 

feature_data_frame_by_exploding_column(df,"features") 

回答

1

Scikit學習的MultiLabelBinarizer創建從標籤二進制矩陣。通過指定MultiLabelBinarizer(sparse_output=True)你會得到一個真正的稀疏輸出(有用的,如果不同特徵的數量大)

mlb = MultiLabelBinarizer() 
new_array = mlb.fit_transform(feature) 

此外:您可以提取大熊貓數據框中feature列,並將它。


輸出示例:

>>> MultiLabelBinarizer().fit_transform([["A", "B"], ["C", "D"]]) 
array([[1, 1, 0, 0], 
     [0, 0, 1, 1]])