2017-04-06 69 views
1

我正在處理兩個非常相似的數據框,我試圖弄清楚如何獲取數據在一個而不是另一個 - 反之亦然。Python如何獲得在一個數據幀中,但不是第二個的值

這是到目前爲止我的代碼:

import pandas as pd 
import numpy as np 


def report_diff(x): 
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x) 

old = pd.read_excel('File 1') 
new = pd.read_excel('File 2') 
old['version'] = 'old' 
new['version'] = 'new' 

full_set = pd.concat([old,new],ignore_index=True) 

changes = full_set.drop_duplicates(subset=['ID','Type', 'Total'], keep='last') 

duplicated = changes.duplicated(subset=['ID', 'Type'], keep=False) 

dupe_accts = changes[duplicated] 

change_new = dupe_accts[(dupe_accts['version'] == 'new')] 

change_old = dupe_accts[(dupe_accts['version'] == 'old')] 

change_new = change_new.drop(['version'], axis=1) 

change_old = change_old.drop(['version'],axis=1) 

change_new.set_index('Employee ID', inplace=True) 

change_old.set_index('Employee ID', inplace=True) 

diff_panel = pd.Panel(dict(df1=change_old,df2=change_new)) 
diff_output = diff_panel.apply(report_diff, axis=0) 

因此,下一步將是獲取只在老只在新的數據。

我第一次嘗試是:

changes['duplicate']=changes['Employee ID'].isin(dupe_accts) 
removed_accounts = changes[(changes['duplicate'] == False) & (changes['version'] =='old')] 
+0

修正了'新= pd.read_excel('文件2)'報價。 –

+0

http://stackoverflow.com/questions/20225110/comparing-two-dataframes-and-getting-the-differences – Dadep

回答

4

我昏了頭看你的代碼!

IIUC:

使用內merge

參數indicator=True考慮dataframes oldnew

old = pd.DataFrame(dict(
     ID=[1, 2, 3, 4, 5], 
     Type=list('AAABB'), 
     Total=[9 for _ in range(5)], 
     ArbitraryColumn=['blah' for _ in range(5)] 
    )) 

new = old.head(2) 

然後mergequeryleft_only

old.merge(
    new, 'outer', on=['ID', 'Type'], 
    suffixes=['', '_'], indicator=True 
).query('_merge == "left_only"') 

    ArbitraryColumn ID Total Type ArbitraryColumn_ Total_  _merge 
2   blah 3  9 A    NaN  NaN left_only 
3   blah 4  9 B    NaN  NaN left_only 
4   blah 5  9 B    NaN  NaN left_only 

我們可以reindex限制對原始列

old.merge(
    new, 'outer', on=['ID', 'Type'], 
    suffixes=['', '_'], indicator=True 
).query('_merge == "left_only"').reindex_axis(old.columns, axis=1) 

    ArbitraryColumn ID Total Type 
2   blah 3  9 A 
3   blah 4  9 B 
4   blah 5  9 B 
+2

我可以upvote兩次,但我已經讀了第一句:) – Vaishali

+0

@piRSquared謝謝!非常簡單的解決方案,我正在尋找。 –

+0

@A-Za-z Meta上的btw人缺乏幽默感https://meta.stackexchange.com/q/293438/326787 – piRSquared

相關問題