2016-06-08 71 views
0

我試圖建立一個函數,計算數據框中的條件香農熵。我給它下面的參數:熊貓的通用內部產品。系列和列中的熊貓。數據幀

import random 
rows = 1000 
columns = 3 

data=pd.DataFrame([[random.randrange(0, 4, 1) for x in range(columns)] for y in range(rows)], columns=['a', 'b', 'c']) 
target = ['a', 'b'] 
conditional = ['c'] 

所以在這個例子中我將是同時計算H(A | c)和H(B | C)。下面的代碼:

""" Split the data into groups according to 'c', then 
    compute the shannon entropy for each column within each group """ 

entropy = data.groupby(conditional)[target].apply(shannon) 
print("Entropy type", type(entropy), "\n",entropy.head(), "\n") 

""" Then compute a Series with the probability of each value of 'c' """ 
prob_condition = data.groupby(conditional)[target].apply(len)/len(data) 
print("Prob type", type(prob_condition), "\n",prob_condition.head(), "\n") 

""" Different ways to compute the mean entropy, weighted 
    by the probability of each occurrence in 'c' """ 
print(entropy.apply((lambda x: (x * prob_condition)))) 
print(entropy.apply(lambda x: prob_condition.dot(x)).head(),"\n") 

產生輸出:

Entropy type <class 'pandas.core.frame.DataFrame'> 
      a   b 
c      
0 1.992605 1.984517 
1 1.987800 1.980181 
2 1.979485 1.994622 
3 1.990220 1.982847 

Prob type <class 'pandas.core.series.Series'> 
c 
0 0.251 
1 0.248 
2 0.264 
3 0.237 
dtype: float64 

Method 1: 
a 1.987384 
b 1.985713 
dtype: float64 

Method 2: 
a 1.987384 
b 1.985713 
dtype: float64 

現在,如果我的目標只是'a',然後我遇到了麻煩:

target = ['a'] 

輸出是:

Entropy type <class 'pandas.core.series.Series'> 
c 
0 1.992605 
1 1.987800 
2 1.979485 
3 1.990220 
dtype: float64 

Prob type <class 'pandas.core.series.Series'> 
c 
0 0.251 
1 0.248 
2 0.264 
3 0.237 
dtype: float64 

Method 1: 
c 
0 1.992605 
1 1.987800 
2 1.979485 
3 1.990220 
dtype: float64 

Traceback (most recent call last): 

    File "<ipython-input-100-d48372bac628>", line 1, in <module> 
    runfile('..../snippet.py', wdir='....') 

    File "..../anaconda3/lib/python3.5/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile 
    execfile(filename, namespace) 

    File "..../anaconda3/lib/python3.5/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile 
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace) 

    File "..../snippet.py", line 21, in <module> 
    print("Method 2: \n", entropy.apply(lambda x: prob_condition.dot(x)).head(),"\n") 

    File "..../anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2237, in apply 
    mapped = lib.map_infer(values, f, convert=convert_dtype) 

    File "pandas/src/inference.pyx", line 1088, in pandas.lib.map_infer (pandas/lib.c:63043) 

    File "..../snippet.py", line 21, in <lambda> 
    print("Method 2: \n", entropy.apply(lambda x: prob_condition.dot(x)).head(),"\n") 

    File "..../anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 1451, in dot 
    if lvals.shape[0] != rvals.shape[0]: 

IndexError: tuple index out of range 

第一種方法並沒有給我正確的答案,因爲我知道x * prob_condition計算兩個向量的外積,我需要內積。在另一方面,.dot功能悲慘的失敗了,甚至我餵養它兩個系列...

我正在尋找一種方法來計算每列的內積entropy和系列prob_conditional,應無論entropy是系列(1列)還是DataFrame(許多列),都可以工作。 PS:你可能會問爲什麼我不做H(a | c)= H(ac)-H(c)。原因是我想要計時,而且我還沒有編碼「聯合」熵。另外,我也不會學你要教我什麼:)

**編輯:**我添加了整個香功能,使代碼可運行:

def shannon(data, conditional=None, target=None): 
    """ if no target is specified, try to guess it """ 
    target = [target] if type(target)==str else target 
    conditional = [conditional] if type(conditional)==str else conditional 

    if target==None and type(data)!=pd.core.series.Series: 
     target=list(set(data.keys())) if conditional == None else [var for var in list(set(data.keys())) if var not in conditional] 

    """ if there are conditions, split data in groups and apply independently """ 
    if conditional!=None: 
     entropy = data.groupby(conditional)[target].apply(shannon) 
     print("Entropy type", type(entropy), "\n",entropy.head()) 
     prob_condition = data.groupby(conditional)[target].apply(len)/len(data) 
     print("Prob type", type(prob_condition), "\n",prob_condition.head()) 
     cond_entropy = entropy.apply((lambda x: (x * prob_condition))) 
     print(entropy.apply(lambda x: prob_condition.dot(x)).head()) 
     print(entropy.apply(lambda x: sum(x * prob_condition)).head()) 
     return cond_entropy if len(cond_entropy)>1 else cond_entropy[0] 


    """ if data is a series compute right away """ 
    if type(data)==pd.core.series.Series: 
     prob=data.value_counts() 
     prob=prob/prob.sum() 
     entropy= - sum([ (p * np.log(p)/np.log(2.0) if p>0 else 0) for p in prob]) 
     return entropy 

    """ if there are no conditions but several columns, evaluate each column independently """ 
    entropy = data[target].apply(shannon,axis=0) 
    return entropy if len(entropy)>1 else entropy[0] 
+1

你能提供的'shannon'功能(或者說,你得到了它,如果它的一些庫的一部分)?沒有這些,你的例子就不能被複制。 – BrenBarn

+0

好吧,這段代碼實際上是它的一部分......我將添加整個功能 –

+1

好的,這可能是問題的一部分。你的'shannon'函數正在檢查它是否在Series或DataFrame上運行。你有沒有考慮過使用'transform'而不是'apply',這樣你的函數將在每一列(而不是整個DataFrame)上被調用?我認爲這將導致更簡單的處理,因爲您應該總是返回一個DataFrame(如果您在'target'中只有一列,則只需要一列DataFrame)。 – BrenBarn

回答

1

好吧,我想通出來。遵循@ BrenBarn的建議,我跟蹤了DataFrames和Series的使用。

我用的情況下type(entropy)==Series遇到的問題,(當只有一個柱,target=['a']),是由於在線路entropy = data.groupby(conditional)[target].apply(shannon)apply功能的意外行爲。當只有一列調用Groupby時,apply會返回一個Series,而documentation則表示它總是會返回一個DataFrame(順便說一句,它並不是非常明確)。這就是問題所在,因爲隨後的調用會提供單個元素(單列行)來計算內部產品,這當然不能完成。

我用Groupby.aggregate調用替換了Groupby.apply調用,該調用具有相同的行爲,並返回一個DataFrame而不管列數。我必須說我對後者的lack of documentation有點不安。

我張貼完整起見整個函數:

def shannon(data, conditional=None, target=None): 
    """ if no target is specified, try to guess it """ 
    target = [target] if type(target)==str else target 
    conditional = [conditional] if type(conditional)==str else conditional 

    if target==None and type(data)!=pd.core.series.Series: 
     target=list(set(data.keys())) if conditional == None else [var for var in list(set(data.keys())) if var not in conditional] 

    """ if there are conditions, split data in groups and apply independently """ 
    if conditional!=None: 
     entropy = data.groupby(conditional)[target].aggregate(shannon) 
     prob_condition = data.groupby(conditional)[target].apply(len)/len(data) 
     cond_entropy = entropy.apply(lambda x: sum(prob_condition * x)) 
     return cond_entropy if len(cond_entropy)>1 else cond_entropy[0] 


    """ if data is a series compute right away """ 
    if type(data)==pd.core.series.Series: 
     prob=data.value_counts() 
     prob=prob/prob.sum() 
     entropy= - sum([ (p * np.log(p)/np.log(2.0) if p>0 else 0) for p in prob]) 
     return entropy 

    """ if there are no conditions but several columns, evaluate each column independently """ 
    entropy = data[target].apply(shannon,axis=0) 
    return entropy if len(entropy)>1 else entropy[0]