2017-09-13 55 views
4

我有一個數據幀在一個非常奇怪的格式:格式化數據幀中的大熊貓

id  Code Week1 Week2 week3 
sunday nan nan nan nan 
id  Code Week1 Week2 week3 
1  100  y  y  n 
2  200  n  y  n 
3  300  n  n  y 
Monday nan nan nan nan 
id  Code Week1 Week2 week3 
1  500  n  y  y 
2  600  y  y  y 
Tuesday nan  nan nan  nan 
id  Code Week1 Week2 week3 
1  800  n  y  y 
2  900  y  n  y  

我想要把它的格式如下:

Code Day Week 
100 Sunday 1 
600 Monday 1 
900 Tuesday 1 
100 Sunday 2 
200 Sunday 2 
500 Monday 2 
600 Monday 2 
800 Tuesday 2 
300 Sunday 3 
500 Monday 3 
600 Monday 3 
800 Tuesday 3 
900 Tuesday 3 

也就是說,如果在一個星期的值爲y爲該守則將於當週訪問。

有沒有辦法在熊貓做到這一點?

+6

第一步的人!確保此數據框的創建者不允許創建更多數據框。 – piRSquared

+0

@piRSquared LoL。我實際上在python中讀取一個excel文件,數據框看起來像這樣:P。這就是爲什麼我卡住了 – Shubham

+1

我的眼睛...他們傷害了... –

回答

1

您可以使用:

df.index = df['id'].where(df['Code'].isnull()).ffill() 
df = df[(df['Code'] != 'Code') & (df['id'] != df.index)] 
df = df.rename_axis('Day').rename_axis('Week', 1) 
df = df.set_index(['id','Code'], append=True) 
     .replace({'n':np.nan}) 
     .stack().reset_index(name='val') 
df['Week'] = df['Week'].str.extract('(\d+)', expand=False).astype(int) 

cols = ['Code','Day','Week'] 
df = df.drop(['val','id'], axis=1)[cols].sort_values(['Week','Code']).reset_index(drop=True) 
print (df) 
    Code  Day Week 
0 100 sunday  1 
1 600 Monday  1 
2 900 Tuesday  1 
3 100 sunday  2 
4 200 sunday  2 
5 500 Monday  2 
6 600 Monday  2 
7 800 Tuesday  2 
8 300 sunday  3 
9 500 Monday  3 
10 600 Monday  3 
11 800 Tuesday  3 
12 900 Tuesday  3 

對於一般的輸出 - id列所有yn值刪除replace

df.index = df['id'].where(df['Code'].isnull()).ffill() 
df = df[(df['Code'] != 'Code') & (df['id'] != df.index)] 
df = df.rename_axis('Day').rename_axis('Week', 1) 
df = df.set_index(['id','Code'], append=True).stack().reset_index(name='val') 
df['Week'] = df['Week'].str.extract('(\d+)', expand=False).astype(int) 
print (df) 
     Day id Code Week val 
0 sunday 1 100  1 y 
1 sunday 1 100  2 y 
2 sunday 1 100  3 n 
3 sunday 2 200  1 n 
4 sunday 2 200  2 y 
5 sunday 2 200  3 n 
6 sunday 3 300  1 n 
7 sunday 3 300  2 n 
8 sunday 3 300  3 y 
9 Monday 1 500  1 n 
10 Monday 1 500  2 y 
11 Monday 1 500  3 y 
12 Monday 2 600  1 y 
13 Monday 2 600  2 y 
14 Monday 2 600  3 y 
15 Tuesday 1 800  1 n 
16 Tuesday 1 800  2 y 
17 Tuesday 1 800  3 y 
18 Tuesday 2 900  1 y 
19 Tuesday 2 900  2 n 
20 Tuesday 2 900  3 y 
3

不是我最好的作品...但我不想再試一次......它傷害了我的靈魂。

d = df.query('id != "id"').replace(dict(id={'\d+': None}), regex=True).ffill() 
s = d[d.duplicated('id')].set_index(['id', 'Code']).replace({'y': 1, 'n': np.nan}).stack() 
s.rename_axis(['Day', 'Code', 'Week']).reset_index('Week').Week.str.replace(
    'week', '', flags=re.IGNORECASE 
).reset_index() 

     Day Code Week 
0 sunday 100 1 
1 sunday 100 2 
2 sunday 200 2 
3 sunday 300 3 
4 Monday 500 2 
5 Monday 500 3 
6 Monday 600 1 
7 Monday 600 2 
8 Monday 600 3 
9 Tuesday 800 2 
10 Tuesday 800 3 
11 Tuesday 900 1 
12 Tuesday 900 3 
0

基於@piRsquared's答案,誰想要的僞燒毛派

In [2689]: (df.query('id != "id"').replace(dict(id={'\d+': np.nan}), regex=True) 
       .assign(id=lambda x: x.ffill()).dropna() 
      .set_index(['id', 'Code']) 
       .replace({'y': 1, 'n': np.nan}) 
       .rename(columns=lambda x: x.lower().replace('week', '')) 
       .stack() 
       .reset_index() 
       .rename(columns={'id': 'Day', 'level_2': 'Week'}) 
       .drop(0, 1)) 
Out[2689]: 
     Day Code Week 
0 sunday 100 1 
1 sunday 100 2 
2 sunday 200 2 
3 sunday 300 3 
4 Monday 500 2 
5 Monday 500 3 
6 Monday 600 1 
7 Monday 600 2 
8 Monday 600 3 
9 Tuesday 800 2 
10 Tuesday 800 3 
11 Tuesday 900 1 
12 Tuesday 900 3 
+0

2689 ...重新啓動sesh的時間。 –