2017-02-23 61 views
1

我有一個DF,看起來像這樣:尋找類似的羣體基於價值觀的不同列的交叉點

Group Attribute 

Cheese Dairy 
Cheese Food 
Cheese Curd 
Cow  Dairy 
Cow  Food 
Cow  Animal 
Cow  Hair 
Cow  Stomachs 
Yogurt Dairy 
Yogurt Food 
Yogurt Curd 
Yogurt Fruity 

我想要什麼,每組做的就是找到它的最喜歡的組,基於屬性的交集。我想要的最終形式是:

Group TotalCount LikeGroup CommonWords PCT 

Cheese 3   Yogurt  3   100.0 
Cow  5   Cheese  2   40.0 
Yogurt 4   Cheese  4   75.0 

我意識到這可能會在一個問題上問很多。我可以做很多事情,但我真的失去了對屬性交集的計數,即使在一個組和另一個組之間也是如此。如果我能找到奶酪和酸奶酪之間的交叉點數,這會使我朝着正確的方向發展。

是否有可能在數據框內做到這一點?我可以看到製作幾個列表並在所有列表對之間進行交集,然後使用新的列表長度來獲取百分比。

例如,對於酸奶:

>>>Yogurt = ['Dairy','Food','Curd','Fruity'] 
>>>Cheese = ['Dairy','Food','Curd'] 

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt) 
0.75 

>>>Yogurt = ['Dairy','Food','Curd','Fruity'] 
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs'] 

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt) 
0.5 

>>>max(Yogurt_Cheese,Yogurt_Cow) 
0.75 

回答

3

我創建了您的樣品我自己的縮小版陣列。

import pandas as pd 
from itertools import permutations 

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute']) 
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later 
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list. 

unique_grp = df['Group'].unique() # get the unique groups 
unique_atr = df['Attribute'].unique() # get the unique attributes 

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups 
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into 
comp_df['CommonWords'] = 0 

for atr in unique_atr: 
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration 

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple 
    for comb in myl: 
     comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry. 

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe 
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val 

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0/comp_df['TotalCount']).round() 

我的樣本數據,我得到了輸出

Group LikeGroup CommonWords TotalCount PCT 
0 cheese  cow   2   3 67 
1 cheese yogurt   3   3 100 
2  cow cheese   2   2 100 
3  cow yogurt   2   2 100 
4 yogurt cheese   3   4 75 
5 yogurt  cow   2   4 50 

這似乎是正確的。

+0

這顯示了所有組的常用詞的百分比,但我可以輕鬆地從這裏開始,我認爲這可能比我所要求的更有用。非常感謝。 –

+0

沒問題。如果有人有類似的問題,你應該接受答案;) – Nemo

1

好像你應該能夠工藝聚合策略來破解這個。嘗試查看這些編碼示例,並考慮如何在數據框架上構建密鑰和聚合函數,而不是像示例中所示的那樣嘗試處理它的郵件。

試着在你的Python環境中運行這個(它是在使用Python 2.7 Jupyter筆記本電腦創建的),看看它是否讓你對你的代碼的一些想法:

np.random.seed(10) # optional .. makes sure you get same random 
         # numbers used in the original experiment 
df = pd.DataFrame({'key1':['a','a','b','b','a'], 
        'key2':['one','two','one','two','one'], 
        'data1': np.random.randn(5), 
        'data2': np.random.randn(5)}) 

df 
group = df.groupby('key1') 
group2 = df.groupby(['key1', 'key2']) 
group2.agg(['count', 'sum', 'min', 'max', 'mean', 'std'])