2016-12-14 94 views
3

我有這樣的數據幀的:的Python - 熊貓據幀與元組

 A  B  C  D 
0 (a,b) (c,d) (e,f) (g,h) 
1 (a,b) (c,d) (e,f) NaN 
2 (a,b) NaN (e,f) NaN 
3 (a,b) NaN  NaN  NaN 

所以在每個單元有一個元組,而我想讓它像這樣:

| A  |  B  |  C  |  D 
0 | a | b | c | d | e | f | g | h 
1 | a | b | c | d | e | f | NaN | NaN 
2 | a | b | NaN | NaN | e | f | NaN | NaN 
3 | a | b | NaN | NaN | NaN | NaN | NaN | NaN 

例如,在列A中有兩列。

謝謝。

+0

爲什麼你不希望創建每個字母(如兩列。 'A1'和'A2')? – MMF

回答

2

您可以使用stackDataFrame.from_records然後sort_indexunstack重塑,swaplevel變更水平MultiIndex列和最後一列進行排序:

stacked = df.stack() 
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index) 
     .unstack(1) 
     .swaplevel(0, 1, 1) 
     .sort_index(axis=1) 
     .replace({None:np.nan}) 
print (df1) 

    A  B   C   D  
    0 1 0 1 0 1 0 1 
0 a b c d e f g h 
1 a b c d e f NaN NaN 
2 a b NaN NaN e f NaN NaN 
3 a b NaN NaN NaN NaN NaN NaN 

最後可能刪除MultiIndex來自列並創建新的列名稱:

stacked = df.stack() 
df1 = pd.DataFrame.from_records(stacked.tolist(), index = stacked.index) 
     .unstack(1) 
     .swaplevel(0, 1, 1) 
     .sort_index(1) 
     .replace({None:np.nan}) 
df1.columns = ['{}{}'.format(col[0], col[1]) for col in df1.columns] 
print (df1) 
    A0 A1 B0 B1 C0 C1 D0 D1 
0 a b c d e f g h 
1 a b c d e f NaN NaN 
2 a b NaN NaN e f NaN NaN 
3 a b NaN NaN NaN NaN NaN NaN 

時序

#len (df)=400 

In [220]: %timeit (pir(df)) 
100 loops, best of 3: 3.45 ms per loop 

In [221]: %timeit (jez(df)) 
100 loops, best of 3: 5.17 ms per loop 

In [222]: %timeit (nick(df)) 
1 loop, best of 3: 231 ms per loop 

In [223]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan})) 
10 loops, best of 3: 152 ms per loop 


#len (df)=4k 

In [216]: %timeit (pir(df)) 
100 loops, best of 3: 16.5 ms per loop 

In [217]: %timeit (jez(df)) 
100 loops, best of 3: 14.8 ms per loop 

In [218]: %timeit (nick(df)) 
1 loop, best of 3: 2.34 s per loop 

In [219]: %timeit (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan})) 
1 loop, best of 3: 1.53 s per loop 

代碼時序

df = pd.DataFrame({"A": [('a','b'),('a','b'),('a','b'),('a','b')], 
        'B': [('c','d'),('c','d'), np.nan,np.nan], 
        'C':[('e','f'),('e','f'),('e','f'),np.nan], 
        'D':[('g','h'),np.nan,np.nan,np.nan]}) 

df = pd.concat([df]*1000).reset_index(drop=True) 
print (df) 

def jez(df): 
    stacked = df.stack() 
    return pd.DataFrame.from_records(stacked.tolist(), index = stacked.index).unstack(1).swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan}) 


print (df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1).replace({None:np.nan})) 

def nick(df): 
    cols = df.columns.values.tolist() 
    return pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols) 

def pir(df): 
    # fillna with (np.nan, np.nan) 
    df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2)) 
    # construct MultiIndex 
    col = pd.MultiIndex.from_product([df.columns, [0, 1]]) 
    # rip off of Nickil's pd.concat but using numpy 
    return pd.DataFrame(np.hstack([np.array(s.values.tolist()) for _, s in df_.iteritems()]), columns=col) 


print (jez(df)) 
print (nick(df)) 
print (pir(df)) 
+0

我正在改進 – piRSquared

1

methon 1
stack + apply

df.stack().apply(pd.Series).unstack().swaplevel(0, 1, 1).sort_index(1) 

enter image description here

方法2

# fillna with (np.nan, np.nan) 
df_ = df.stack().unstack(fill_value=tuple([np.nan] * 2)) 
# construct MultiIndex 
col = pd.MultiIndex.from_product([df.columns, [0, 1]]) 
# rip off of Nickil's pd.concat but using numpy 
pd.DataFrame(
    np.hstack(
     [np.array(s.values.tolist()) \ 
     for _, s in df_.iteritems()] 
    ), columns=col) 

enter image description here

+0

嗯,我認爲這是不好的想法在第二個解決方案中重複列,你怎麼看? – jezrael

+0

@jezrael我決定改變它 – piRSquared

+0

好的,我將你的代碼添加到時間。 – jezrael

2

拆分的tuples存在於每個串聯成使用apply單個元素。然後,將所有這些列連接在一起,並使用keys參數提供與原始DF相同的標題。

cols = df.columns.values.tolist() 
pd.concat([df[col].apply(pd.Series) for col in cols], axis=1, keys=cols) 

enter image description here