2017-04-03 59 views
2

我在pandas Dataframe中有以下數據集。合併多個組ids以形成一個統一的組ID?

 
group_id sub_group_id 
0   0 
0   1 
1   0 
2   0 
2   1 
2   2 
3   0  
3   0 

但我想那些組ID,並形成一個統一的組ID

 
group_id sub_group_id consolidated_group_id 
0   0     0 
0   1     1 
1   0     2 
2   0     3 
2   1     4 
2   2     5 
2   2     5 
3   0     6 
3   0     6 

是否有任何通用的或數學的方式來做到這一點?

回答

1

您需要值轉換爲tuples然後用factorize

df['consolidated_group_id'] = pd.factorize(df.apply(tuple,axis=1))[0] 
print (df) 

    group_id sub_group_id consolidated_group_id 
0   0    0      0 
1   0    1      1 
2   1    0      2 
3   2    0      3 
4   2    1      4 
5   2    2      5 
6   3    0      6 
7   3    0      6 

NumPy的解決方案是有點修改this answer - 改變順序由[::-1]與選擇由[0]退貨陣列(numpy.unique):

a = df.values 

def unique_return_inverse_2D(a): # a is array 
    a1D = a.dot(np.append((a.max(0)+1)[:0:-1].cumprod()[::-1],1)) 
    return np.unique(a1D, return_inverse=1)[::-1][0] 


def unique_return_inverse_2D_viewbased(a): # a is array 
    a = np.ascontiguousarray(a) 
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:]))) 
    return np.unique(a.view(void_dt).ravel(), return_inverse=1)[::-1][0] 

df['consolidated_group_id'] = unique_return_inverse_2D(a) 
df['consolidated_group_id1'] = unique_return_inverse_2D_viewbased(a) 
print (df) 
    group_id sub_group_id consolidated_group_id consolidated_group_id1 
0   0    0      0      0 
1   0    1      1      1 
2   1    0      2      2 
3   2    0      3      3 
4   2    1      4      4 
5   2    2      5      5 
6   3    0      6      6 
7   3    0      6      6 
1
cols = ['group_id', 'sub_group_id'] 
df.assign(
    consolidated_group_id=pd.factorize(
     pd.Series(list(zip(*df[cols].values.T.tolist()))) 
    )[0] 
) 

    group_id sub_group_id consolidated_group_id 
0   0    0      0 
1   0    1      1 
2   1    0      2 
3   2    0      3 
4   2    1      4 
5   2    2      5 
6   3    0      6 
7   3    0      6