2016-09-07 40 views
3

我有一個數據框,其中有多個列名稱相似的列。我希望空單元格填充右側有數據的那些列。將非空單元格移動到分組列中的左側大熊貓

Address1  Address2  Address3  Address4  Phone1  Phone2  Phone3  Phone4 
ABC   nan   def   nan   9091-XYz nan  nan  XYZ-ABZ 

應該是在列轉移到像

Address1  Address2  Address3  Address4  Phone1  Phone2  Phone3  Phone4 
ABC   def   nan   nan   9091-XYz XYZ-ABZ nan  nan 

還有一個question解決了類似的問題。

pdf = pd.read_csv('Data.txt',sep='\t') 

# gets a set of columns removing the numerical part 
columns = set(map(lambda x : x.rstrip(''),pdf.columns)) 

for col_pattern in columns: 
    # get columns with similar names 
    current = [col for col in pdf.columns if col_pattern in col] 
    coldf= pdf[current] 
    # shift columns to the left 

文件Data.txt具有通過列名排序列,以便它們都具有類似名稱的列走到了一起。

任何幫助表示讚賞

我曾試圖加入這個從鏈接上面的代碼,這耗盡了內存:

newdf=pd.read_csv(StringIO(u''+re.sub(',+',',',df.to_csv()).decode('utf-8'))) 
    list_.append(newdf) 
pd.concat(list_,axis=0).to_csv('test.txt') 

回答

3

解決方案與MultiIndexdropna提取非數字前綴:

import pandas as pd 
import numpy as np 

df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 
        'Address2': {0: np.nan, 1: np.nan}, 
        'Address3': {0: 'def', 1: 'def'}, 
        'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 
        'Address4': {0: np.nan, 1: np.nan}, 
        'Phone1': {0: '9091-XYz', 1: 'Z9091-XYz'}, 
        'Phone3': {0: np.nan, 1: 'aaa'}, 
        'Phone2': {0: np.nan, 1: np.nan}}) 

print (df) 
    Address1 Address2 Address3 Address4  Phone1 Phone2 Phone3 Phone4 
0  ABC  NaN  def  NaN 9091-XYz  NaN NaN XYZ-ABZ 
1  ABC  NaN  def  NaN Z9091-XYz  NaN aaa XYZ-ABZ 
#multiindex from columns of df 
cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist() 

mux = pd.MultiIndex.from_tuples(cols) 
df.columns = mux 
print (df) 
    Address     Phone     
     1 2 3 4   1 2 3  4 
0  ABC NaN def NaN 9091-XYz NaN NaN XYZ-ABZ 
1  ABC NaN def NaN Z9091-XYz NaN aaa XYZ-ABZ 

#unstack, remove NaN rows, convert to df (because cumcount) 
df1 = df.unstack().dropna().reset_index(level=1, drop=True).to_frame() 
#create new level of index 
df1['g'] = (df1.groupby(level=[0,1]).cumcount() + 1).astype(str) 
#add column g to multiindex 
df1.set_index('g', append=True, inplace=True) 
#reshape to original 
df1 = df1.unstack(level=[0,2]) 
#remove first level of multiindex of column (0 from to_frame) 
df1.columns = df1.columns.droplevel(0) 
#reindex and replace None to NaN 
df1 = df1.reindex(columns=mux).replace({None: np.nan}) 
#'reset' multiindex in columns 
df1.columns = [''.join(col) for col in df1.columns] 
print (df1) 
    Address1 Address2 Address3 Address4  Phone1 Phone2 Phone3 Phone4 
0  ABC  def  NaN  NaN 9091-XYz XYZ-ABZ  NaN  NaN 
1  ABC  def  NaN  NaN Z9091-XYz  aaa XYZ-ABZ  NaN 

舊的解決方案:

我發現另一個問題 - 如果在DataFrame中有更多的行,上面的解決方案可以正常工作。所以你可以使用雙重apply。但是,這種解決方案的問題是行值uncorrect順序:

df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 'Address2': {0: np.nan, 1: np.nan}, 'Address3': {0: 'def', 1: 'def'}, 'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 'Address4': {0: np.nan, 1: np.nan}, 'Phone1': {0: '9091-XYz', 1: '9091-XYz'}, 'Phone3': {0: np.nan, 1: 'aaa'}, 'Phone2': {0: np.nan, 1: np.nan}}) 

print (df) 
    Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4 
0  ABC  NaN  def  NaN 9091-XYz  NaN NaN XYZ-ABZ 
1  ABC  NaN  def  NaN 9091-XYz  NaN aaa XYZ-ABZ 

cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist() 
mux = pd.MultiIndex.from_tuples(cols) 
df.columns = mux 

df = df.groupby(axis=1, level=0) 
     .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1)) 

df.columns = [''.join(col) for col in df.columns] 
print (df) 
    Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4 
0  ABC  def  NaN  NaN 9091-XYz XYZ-ABZ NaN  NaN 
1  ABC  def  NaN  NaN 9091-XYz XYZ-ABZ aaa  NaN 

我也嘗試修改piRSquared解決方案 - 那麼你不需要MultiIndex

coltype = df.columns.str.extract(r'([[A-Za-z]+)', expand=False) 
print (coltype) 
Index(['Address', 'Address', 'Address', 'Address', 'Phone', 'Phone', 'Phone', 
     'Phone'], 
     dtype='object') 

df = df.groupby(coltype, axis=1) 
     .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1)) 
print (df) 
    Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4 
0  ABC  def  NaN  NaN 9091-XYz XYZ-ABZ NaN  NaN 
1  ABC  def  NaN  NaN 9091-XYz XYZ-ABZ aaa  NaN 
+0

對樣本使用多重索引可以減少約3倍的輸出時間。 –

+0

是的,但也許還有另一個問題 - 所有NaN都在一列嗎?或者有時某些列中的某些值是NaN和另一個值? – jezrael

+0

我想'df = pd.DataFrame({'Address1':{0:'ABC',1:'ABC'},'Address2':{0:np.nan,1:np.nan},'Address3' :{0:'def',1:'def'},'Phone4':{0:'XYZ-ABZ',1:'XYZ-ABZ'},'Address4':{0:np.nan, Phone1':{0:'9091-XYz',1:'9091-XYz'},'Phone3':{0:np.nan,1:'aaa'},'Phone2':{ 0:np.nan,1:np.nan}})',看第二行用'Phone3' – jezrael

2

pushna
全押空值該系列的結尾

coltype
使用regex從所有列名

def pushna(s): 
    notnull = s[s.notnull()] 
    isnull = s[s.isnull()] 
    values = notnull.append(isnull).values 
    return pd.Series(values, s.index) 

coltype = df.columns.to_series().str.extract(r'(\D*)', expand=False) 

df.groupby(coltype, axis=1).apply(lambda df: df.apply(pushna, axis=1)) 

enter image description here

+0

我有2.5十萬行的CSV。一直在運行它。希望它很快就會完成。 –

相關問題