解決方案與MultiIndex
和dropna
提取非數字前綴:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'},
'Address2': {0: np.nan, 1: np.nan},
'Address3': {0: 'def', 1: 'def'},
'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'},
'Address4': {0: np.nan, 1: np.nan},
'Phone1': {0: '9091-XYz', 1: 'Z9091-XYz'},
'Phone3': {0: np.nan, 1: 'aaa'},
'Phone2': {0: np.nan, 1: np.nan}})
print (df)
Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4
0 ABC NaN def NaN 9091-XYz NaN NaN XYZ-ABZ
1 ABC NaN def NaN Z9091-XYz NaN aaa XYZ-ABZ
#multiindex from columns of df
cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist()
mux = pd.MultiIndex.from_tuples(cols)
df.columns = mux
print (df)
Address Phone
1 2 3 4 1 2 3 4
0 ABC NaN def NaN 9091-XYz NaN NaN XYZ-ABZ
1 ABC NaN def NaN Z9091-XYz NaN aaa XYZ-ABZ
#unstack, remove NaN rows, convert to df (because cumcount)
df1 = df.unstack().dropna().reset_index(level=1, drop=True).to_frame()
#create new level of index
df1['g'] = (df1.groupby(level=[0,1]).cumcount() + 1).astype(str)
#add column g to multiindex
df1.set_index('g', append=True, inplace=True)
#reshape to original
df1 = df1.unstack(level=[0,2])
#remove first level of multiindex of column (0 from to_frame)
df1.columns = df1.columns.droplevel(0)
#reindex and replace None to NaN
df1 = df1.reindex(columns=mux).replace({None: np.nan})
#'reset' multiindex in columns
df1.columns = [''.join(col) for col in df1.columns]
print (df1)
Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4
0 ABC def NaN NaN 9091-XYz XYZ-ABZ NaN NaN
1 ABC def NaN NaN Z9091-XYz aaa XYZ-ABZ NaN
舊的解決方案:
我發現另一個問題 - 如果在DataFrame
中有更多的行,上面的解決方案可以正常工作。所以你可以使用雙重apply
。但是,這種解決方案的問題是行值uncorrect順序:
df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 'Address2': {0: np.nan, 1: np.nan}, 'Address3': {0: 'def', 1: 'def'}, 'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 'Address4': {0: np.nan, 1: np.nan}, 'Phone1': {0: '9091-XYz', 1: '9091-XYz'}, 'Phone3': {0: np.nan, 1: 'aaa'}, 'Phone2': {0: np.nan, 1: np.nan}})
print (df)
Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4
0 ABC NaN def NaN 9091-XYz NaN NaN XYZ-ABZ
1 ABC NaN def NaN 9091-XYz NaN aaa XYZ-ABZ
cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist()
mux = pd.MultiIndex.from_tuples(cols)
df.columns = mux
df = df.groupby(axis=1, level=0)
.apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))
df.columns = [''.join(col) for col in df.columns]
print (df)
Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4
0 ABC def NaN NaN 9091-XYz XYZ-ABZ NaN NaN
1 ABC def NaN NaN 9091-XYz XYZ-ABZ aaa NaN
我也嘗試修改piRSquared
解決方案 - 那麼你不需要MultiIndex
:
coltype = df.columns.str.extract(r'([[A-Za-z]+)', expand=False)
print (coltype)
Index(['Address', 'Address', 'Address', 'Address', 'Phone', 'Phone', 'Phone',
'Phone'],
dtype='object')
df = df.groupby(coltype, axis=1)
.apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))
print (df)
Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4
0 ABC def NaN NaN 9091-XYz XYZ-ABZ NaN NaN
1 ABC def NaN NaN 9091-XYz XYZ-ABZ aaa NaN
對樣本使用多重索引可以減少約3倍的輸出時間。 –
是的,但也許還有另一個問題 - 所有NaN都在一列嗎?或者有時某些列中的某些值是NaN和另一個值? – jezrael
我想'df = pd.DataFrame({'Address1':{0:'ABC',1:'ABC'},'Address2':{0:np.nan,1:np.nan},'Address3' :{0:'def',1:'def'},'Phone4':{0:'XYZ-ABZ',1:'XYZ-ABZ'},'Address4':{0:np.nan, Phone1':{0:'9091-XYz',1:'9091-XYz'},'Phone3':{0:np.nan,1:'aaa'},'Phone2':{ 0:np.nan,1:np.nan}})',看第二行用'Phone3' – jezrael