2017-07-14 50 views
1

我試圖從映射的字典中爲數據框添加至少一個或多個列。我有一本產品目錄編號的字典,其中包含該產品編號的標準化分層命名清單。下面的例子。Python - 從包含值列表的字典中添加具有映射值的新列

dict = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']} 
df = pd.DataFrame({"product": [1, 2, 3]}) 
df['catagory'] = df['product'].map(dict) 
print(df) 

我得到以下結果:

product  catagory 
0  1 [a, b, c, d] 
1  2 [w, x, y, z] 
2  3   NaN 

我想獲取以下信息:

 product  cat1  cat2  cat3  cat4 
0  1   a  b  c   d 
1  2   w  x  y   z 
2  3   NaN  NaN  NaN  NaN 

甚至更​​好:

 product  category 
0  1   d 
1  2   z 
2  3   NaN 

我一直在努力只是爲了解析我們的一個項目字典中的列表並將其追加到數據框中,但只能根據此EXAMPLE找到映射包含列表中的一個項目的字典的建議。

任何幫助表示讚賞。

+0

這可能會有所幫助:https://開頭stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows/32470490#32470490 – Alexander

回答

0

再拿​​,applyadd_prefixreset_index

df_out = (df.set_index('product')['catagory'] 
    .apply(lambda x:pd.Series(x))) 

df_out.columns = df_out.columns + 1 

df_out.add_prefix('cat').reset_index() 

輸出:

product cat1 cat2 cat3 cat4 
0  1 a b c d 
1  2 w x y z 
2  3 NaN NaN NaN NaN 

要到下一個even better值存取:

(df.set_index('product')['catagory'] 
    .apply(lambda x:pd.Series(x)) 
    .stack(dropna=False) 
    .rename('category') 
    .reset_index() 
    .drop('level_1',axis=1) 
    .drop_duplicates() 
) 

輸出:

product category 
0  1  a 
1  1  b 
2  1  c 
3  1  d 
4  2  w 
5  2  x 
6  2  y 
7  2  z 
8  3  NaN 
0

注意

不要使用保留字像listtypedict ...作爲掩蔽因爲內置函數變量。

因此,如果使用:

#dict is variable name 
dict = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']} 
#create dictionary is not possible, because dict is dictionary 
print (dict(a=1, b=2)) 
{'a': 1, 'b': 2} 

得到錯誤:

TypeError: 'dict' object is not callable

和調試是非常複雜的。(測試重新啓動IDE後)

所以請使用其他變量像dcategories

d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']} 
print (dict(a=1, b=2)) 
{'a': 1, 'b': 2} 

我認爲你需要DataFrame.from_dictjoin

d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']} 
df = pd.DataFrame({"product": [1, 2, 3]}) 
print (df) 
    product 
0  1 
1  2 
2  3 

df1 = pd.DataFrame.from_dict(d, orient='index') 
df1.columns = ['cat' + (str(i+1)) for i in df1.columns] 
print(df1) 
    cat1 cat2 cat3 cat4 
1 a b c d 
2 w x y z 

df2 = df.join(df1, on='product') 
print (df2) 
    product cat1 cat2 cat3 cat4 
0  1 a b c d 
1  2 w x y z 
2  3 NaN NaN NaN NaN 

然後可以使用meltstack

df3 = df2.melt('product', value_name='category').drop('variable', axis=1) 
print (df3) 
    product category 
0   1  a 
1   2  w 
2   3  NaN 
3   1  b 
4   2  x 
5   3  NaN 
6   1  c 
7   2  y 
8   3  NaN 
9   1  d 
10  2  z 
11  3  NaN 

df2 = df.set_index('product').join(df1) 
     .stack(dropna=False) 
     .reset_index(level=1, drop=True) 
     .rename('category') 
     .reset_index() 
print (df2) 
    product category 
0   1  a 
1   1  b 
2   1  c 
3   1  d 
4   2  w 
5   2  x 
6   2  y 
7   2  z 
8   3  NaN 
9   3  NaN 
10  3  NaN 
11  3  NaN 

如果列categorydf解決方案是類似的,只是有必要刪除行與NaNDataFrame.dropna

d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']} 
df = pd.DataFrame({"product": [1, 2, 3]}) 
df['category'] = df['product'].map(d) 
print(df) 

df1 = df.dropna(subset=['category']) 
df1 = pd.DataFrame(df1['category'].values.tolist(), index=df1['product']) 
df1.columns = ['cat' + (str(i+1)) for i in df1.columns] 
print(df1) 
     cat1 cat2 cat3 cat4 
product      
1   a b c d 
2   w x y z 

df2 = df[['product']].join(df1, on='product') 
print (df2) 
    product cat1 cat2 cat3 cat4 
0  1 a b c d 
1  2 w x y z 
2  3 NaN NaN NaN NaN 
0
d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']} 

#Split product to 4 columns 
df[['product']].join(
    df.apply(lambda x: pd.Series(d.get(x['product'],[np.nan])),axis=1) 
     .rename_axis(lambda x: 'cat{}'.format(x+1), axis=1) 
    ) 
Out[187]: 
    product cat1 cat2 cat3 cat4 
0  1 a b c d 
1  2 w x y z 
2  3 NaN NaN NaN NaN 

#only take the last element 
df['catagory'] = df.apply(lambda x: d.get(x['product'],[np.nan])[-1],axis=1) 

df 
Out[171]: 
    product catagory 
0  1  d 
1  2  z 
2  3  NaN