2017-04-24 144 views
3

我有這樣的例子pandas.DataFrame與+ 20K行創建標籤的新列,在下面的表格:大熊貓據幀:基於其他列

import pandas as pd 
import numpy as np 

data = {"first_column": ["A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"], 
     "second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]} 

df = pd.DataFrame(data) 

>>> df 
    first_column second_column 
0    A    0 
1    B    1 
2    B    1 
3    B    1 
4    C    0 
5    A    0 
6    A    0 
7    A    1 
8    D    1 
9    B    1 
10   A    1 
11   A    0 
.... 

first_column對每一行ABC,和D。在第二列中,有一個表示一組值的二進制標籤。 1的所有連續分組都是獨特的「組」,例如,第1-3行是一組,第7-10行是另一組。

我想通過「AB」(該組僅由A或B組成),「CD」(該組僅由C或D組成)或「 「混合」(如果有混合,例如全部B和一個C)。知道這些分組中的某些百分比是多少,即AB的百分比超出總標籤的百分比也是有用的。所以,如果它只是AB,身份應該是AB。如果它只是CD,身份應爲CD。它是A,B,C和/或D的混合物,那麼它是mixed。百分比(AB行數)/(#總行)

這裏是如何產生的DataFrame看起來是:

>>> df 
    first_column second_column identity percent 
0    A    0   0   0 
1    B    1   AB  1.0 
2    B    1   AB  1.0 
3    B    1   AB  1.0 
4    C    0   0   0 
5    A    0   0   0 
6    A    0   0   0 
7    A    1  mixed  0.75 # 3/4, 3-AB, 4-total 
8    D    1  mixed  0.75 
9    B    1  mixed  0.75 
10   A    1  mixed  0.75 
11   A    0   0   0 
.... 

我最初的想法是首先嚐試使用df.loc()

if (df.first_column == "A" | df.first_column == "B"): 
    df.loc[df.second_column == 1, "identity"] = "AB" 
if (df.first_column == "C" | df.first_column == "D"): 
    df.loc[df.second_column == 1, "identity"] = "CD" 

但這不考慮混合物,也不適用於孤立的分組。

+0

我不明白怎麼算混合 - 您可以根據數學公式解釋一下嗎? – Edward

+0

@愛德華對不起。如果它只有A或B,那麼'identity'應該是'AB'。如果只有C或D,那麼'identity'應該是'CD'。它是A,B,C和/或D的混合物,然後混合。這個百分比是'(AB行數量)/(總行數量)' – ShanZhengYang

回答

4

這是一種方法。

代碼:

import pandas as pd 

from collections import Counter 
a_b = set('AB') 
c_d = set('CD') 

def get_id_percent(group): 
    present = Counter(group['first_column']) 
    present_set = set(present.keys()) 

    if group['second_column'].iloc[0] == 0: 
     ret_val = 0, 0 
    elif present_set.issubset(a_b) and len(present_set) == 1: 
     ret_val = 'AB', 0 
    elif present_set.issubset(c_d) and len(present_set) == 1: 
     ret_val = 'CD', 0 
    else: 
     ret_val = 'mixed', \ 
       float(present['A'] + present['B'])/len(group) 

    return pd.DataFrame(
     [ret_val] * len(group), columns=['identity', 'percent']) 

測試代碼:

data = {"first_column": ["A", "B", "B", "B", "C", "A", "A", 
         "A", "D", "B", "A", "A"], 
     "second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]} 

df = pd.DataFrame(data) 

groupby = df.groupby((df.second_column != df.second_column.shift()).cumsum()) 

results = groupby.apply(get_id_percent).reset_index() 
results = results.drop(['second_column', 'level_1'], axis=1) 
df = pd.concat([df, results], axis=1) 
print(df) 

結果:

first_column second_column identity percent 
0    A    0  0  0.00 
1    B    1  AB  0.00 
2    B    1  AB  0.00 
3    B    1  AB  0.00 
4    C    0  0  0.00 
5    A    0  0  0.00 
6    A    0  0  0.00 
7    A    1 mixed  0.75 
8    D    1 mixed  0.75 
9    B    1 mixed  0.75 
10   A    1 mixed  0.75 
11   A    0  0  0.00 
+0

感謝!除了一些「百分比」值之外,它工作得很好。其中一些看起來是關閉的,例如, '1'當它應該是'0.5','0.4'時應該是'0.6'。有沒有辦法檢查/調試呢? – ShanZhengYang

+0

要進行調試,您可以返回比當前兩列更多的值,以查看計算中正在使用的值。 –

+0

謝謝。我最終返回了每個計數的列,然後再除以比例,例如, float(present [「A」]),float(present [「B」]),float(present [「B」]),...'。 看來有些「CD」組被標記爲「混合」。也許這是由於「C」或「D」中的空格? – ShanZhengYang

1

這裏有一個Appro公司ACH:

import pandas as pd 

# generate example data 
data = {"first_column": ["A", "B", "B", "B", "C", "A", "A", "A", "D", "B", "A", "A"], 
    "second_column": [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]} 
df = pd.DataFrame(data) 

# these are intermediary groups for computation 
df['group_type'] = None 
df['ct'] = 0 

def find_border(x, ct): 
    ''' finds and labels lettered groups ''' 
    ix = x.name 
    # does second_column == 1? 
    if x.second_column: 
     # if it's the start of a group... 
     if (not ix) | (not df.group_type[ix-1]): 
      df.ix[ix,'group_type'] = x.first_column 
      df.ix[ix,'ct'] += 1 
      return 
     # if it's the end of a group 
     elif (not df.second_column[ix+1]): 
       df.ix[ix,'group_type'] = df.group_type[ix-1] + x.first_column 
       df.ix[ix,'ct'] = df.ct[ix-1] + 1 
       for i in range(df.ct[ix-1]+1): 
        df.ix[ix-i,'group_type'] = df.ix[ix,'group_type'] 
       df.ix[ix,'ct'] = 0 
       return 
     # if it's the middle of a group 
     else: 
      df.ix[ix,'ct'] = df.ct[ix-1] + 1 
      df.ix[ix,'group_type'] = df.group_type[ix-1] + x.first_column 
      return 
    return 

# compute group membership 
_=df.apply(find_border, axis='columns', args=(0,)) 

def determine_id(x): 
    if not x: 
     return '0' 
    if list(set(x)) in [['A'],['B'],['A','B']]: 
     return 'AB' 
    elif list(set(x)) in [['C'],['D'],['C','D']]: 
     return 'CD' 
    else: 
     return 'mixed' 

def determine_pct(x): 
    if not x: 
     return 0 
    return sum([1 for letter in x if letter in ['A','B']])/float(len(x)) 

# determine row identity 
df['identity'] = df.group_type.apply(determine_id) 

# determine % of A or B in group 
df['percent'] = df.group_type.apply(determine_pct) 

輸出:

first_column second_column identity percent 
0    A    0  0  0.00 
1    B    1  AB  1.00 
2    B    1  AB  1.00 
3    B    1  AB  1.00 
4    C    0  0  0.00 
5    A    0  0  0.00 
6    A    0  0  0.00 
7    A    1 mixed  0.75 
8    D    1 mixed  0.75 
9    B    1 mixed  0.75 
10   A    1 mixed  0.75  
11   A    0  0  0.00 
+0

謝謝。你有計算'百分比'列的方法嗎? – ShanZhengYang

+0

當然,請參閱我的更新解決方案。 –

+0

雖然請注意,它實際上不是百分比,而是比例,在「百分比」列中。 –