2017-03-15 60 views
3

的是有沒有辦法合併兩個數據幀,如果從左側的數據幀列的任意一個相符的正確數據幀的列:熊貓:合併如果左欄匹配任何右欄

SELECT 
    t1.*, t2.* 
FROM 
    t1 
JOIN 
    t2 ON t1.c1 = t2.c1 OR 
     t1.c1 = t2.c2 OR 
     t1.c1 = t2.c3 OR 
     t1.c1 = t2.c4 

的Python (類似):

import pandas as pd 

dataA = [(1), (2)] 

pdA = pd.DataFrame(dataA) 
pdA.columns = ['col'] 

dataB = [(1, None), (None, 2), (1, 2)] 

pdB = pd.DataFrame(dataB) 
pdB.columns = ['col1', 'col2'] 

pdA.merge(pdB, left_on='col', right_on='col1') \ 
    .append(pdA.merge(pdB, left_on='col', right_on='col2')) 

enter image description here enter image description here enter image description here

+0

我假設第三個數據幀並不完全是你想要的。你能嘲笑一個正是你想要的數據框嗎? –

+0

@PaulH實際上,如果應用* ignore_index = True *和* .drop_duplicates()*來消除左列值與右列值相匹配時發生的重複行, –

+0

那麼問題是什麼?好像你有你的答案。 –

回答

0

看起來我們正在逐行做isin檢查。我喜歡使用設置邏輯並使用numpy廣播來幫忙。

f = lambda x: set(x.dropna()) 
npB = pdB.apply(f, 1).values 
npA = pdA.apply(f, 1).values 

a = npA <= npB[:, None] 
m, n = a.shape 

rA = np.tile(np.arange(n), m) 
rB = np.repeat(np.arange(m), n) 

a_ = a.ravel() 

pd.DataFrame(
    np.hstack([pdA.values[rA[a_]], pdB.values[rB[a_]]]), 
    columns=pdA.columns.tolist() + pdB.columns.tolist() 
) 

    col col1 col2 
0 1.0 1.0 NaN 
1 2.0 NaN 2.0 
2 1.0 1.0 2.0 
3 2.0 1.0 2.0 
0

不幸的是,我不認爲有內置的方法來做到這一點。 pandas連接相當有限,因爲基本上只能測試左列與右列的相等性,而不像SQL更一般。

雖然可以通過形成交叉產品然後檢查所有相關條件來做到這一點。它因此消耗了一些內存,但它不應該太低效。

注意我稍微改變了你的測試用例,使它們更一般化,並將變量重命名爲更直觀的東西。

import pandas as pd 
from functools import reduce 

dataA = [1, 2] 

dfA = pd.DataFrame(dataA) 
dfA.columns = ['col'] 

dataB = [(1, None, 1), (None, 2, None), (1, 2, None)] 

dfB = pd.DataFrame(dataB) 
dfB.columns = ['col1', 'col2', 'col3'] 

print(dfA) 
print(dfB) 


def cross(left, right): 
    """Returns the cross product of the two dataframes, keeping the index of the left""" 

    # create dummy columns on the dataframes that will always match in the merge 
    left["_"] = 0 
    right["_"] = 0 

    # merge, keeping the left index, and dropping the dummy column 
    result = left.reset_index().merge(right, on="_").set_index("index").drop("_", axis=1) 

    # drop the dummy columns from the mutated dataframes 
    left.drop("_", axis=1, inplace=True) 
    right.drop("_", axis=1, inplace=True) 
    return result 


def merge_left_in_right(left_df, right_df): 
    """Return the join of the two dataframes where the element of the left dataframe's column 
    is in one of the right dataframe's columns""" 

    left_col, right_cols = left_df.columns[0], right_df.columns 

    result = cross(left_df, right_df) # form the cross product with a view to filtering it 

    # a row must satisfy one of the following conditions: 
    tests = (result[left_col] == result[right_col] for right_col in right_cols) 

    # form the disjunction of the conditions 
    left_in_right = reduce(lambda left_bools, right_bools: left_bools | right_bools, tests) 

    # return the appropriate rows 
    return result[left_in_right] 


print(merge_left_in_right(dfA, dfB))