如何獨熱編碼數據幀，每一行都有列出

我想在有數據的列表中列出的機器學習算法的數據養活：如何獨熱編碼數據幀，每一行都有列出

例如病人可能有幾種藥物和幾個他們可能也有名字的藥物反應。因此，如果他們服用超過1種藥物，它將列入2個或更多的列表。他們只有一個名字。

我相信一個熱門的編碼是正確的方法。

這是我迄今所做的：

我有一個數據幀：

df = pandas.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, {'drug': ['drugC','drugD'], 'patient': 'angel'}]) 

      drug patient 
0 [drugA, drugB] john 
1 [drugC, drugD] angel

我想要得到的東西，如：

drugA drugB drugC drugD patient 
0 1  1  0  0  john 
0 0  0  1  1  angel

我嘗試這樣做：

pandas.get_dummies(df.apply(pandas.Series).stack()).sum(level=0)

但是得到了：

TypeError: unhashable type: 'list'

來源

2017-04-23 Kevin

上this answer重畫，這裏有一個方法：

df = pd.DataFrame([{'drug': ['drugA','drugB'], 'patient': 'john'}, 
        {'drug': ['drugC','drugD'], 'patient': 'angel'}]) 
s = df.drug 
     .apply(lambda x: pd.Series(x)) 
     .unstack() 
df2 = df.join(pd.DataFrame(s.reset_index(level=0, drop=True))) 
     .drop('drug',1) 
     .rename(columns={0:'drug'}) 
df2.merge(pd.get_dummies(df2.drug), left_index=True, right_index=True) 
    .drop('drug',1)

輸出：

patient drugA drugB drugC drugD 
0 john 1.0 0.0 0.0 0.0 
0 john 0.0 1.0 0.0 0.0 
0 john 1.0 0.0 0.0 0.0 
0 john 0.0 1.0 0.0 0.0 
1 angel 0.0 0.0 1.0 0.0 
1 angel 0.0 0.0 0.0 1.0 
1 angel 0.0 0.0 1.0 0.0 
1 angel 0.0 0.0 0.0 1.0

來源

2017-04-23 02:12:23

用途：

pop用於提取塔或省略，並使用drop
新DataFrame通過values和numpy.ndarray.tolist
pandas.get_dummies
groupby + max
concat原始

df1 = pd.get_dummies(pd.DataFrame(df.pop('drug').values.tolist()), prefix='', prefix_sep='') 
     .groupby(axis=1, level=0).max() 

df1 = pd.concat([df1, df], axis=1) 
print (df1) 
    drugA drugB drugC drugD patient 
0  1  1  0  0 john 
1  0  0  1  1 angel

創建

df1 = pd.get_dummies(pd.DataFrame(df['drug'].values.tolist()), prefix='', prefix_sep='') \ 
     .groupby(axis=1, level=0).max() 

df1 = pd.concat([df1, df.drop('drug', axis=1)], axis=1) 
print (df1) 
    drugA drugB drugC drugD patient 
0  1  1  0  0 john 
1  0  0  1  1 angel

replace + str.get_dummies
concat到原始

df1 = df.pop('drug').astype(str).replace(['\[','\]', "'", "\s+"], '', regex=True) 
       .str.get_dummies(',') 
df1 = pd.concat([df1, df], axis=1) 
print (df1) 
    drugA drugB drugC drugD patient 
0  1  1  0  0 john 
1  0  0  1  1 angel

df1 = df['drug'].astype(str).replace(['\[','\]', "'", "\s+"], '', regex=True) 
       .str.get_dummies(',') 
df1 = pd.concat([df1, df.drop('drug', axis=1)], axis=1) 
print (df1) 
    drugA drugB drugC drugD patient 
0  1  1  0  0 john 
1  0  0  1  1 angel

來源

2017-04-23 05:55:26 jezrael

如何獨熱編碼數據幀，每一行都有列出

回答

相關問題