2016-09-13 120 views
1

我有兩個dataframes:基於另一個大熊貓據幀有條件地提取大熊貓行

df1:

col1 col2 
1  2 
1  3 
2  4 

df2:

col1 
2 
3 

我想提取df1其中df1的所有行col2not indf2's col1 。所以在這種情況下,這將是:

col1 col2 
2  4 

我第一次嘗試:

df1[df1['col2'] not in df2['col1']] 

但它返回:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

然後我想:

df1[df1['col2'] not in df2['col1'].tolist] 

但返回:

TypeError: argument of type 'instancemethod' is not iterable

回答

1

您可以使用isin~,反轉布爾面膜:

print (df1['col2'].isin(df2['col1'])) 
0  True 
1  True 
2 False 
Name: col2, dtype: bool 

print (~df1['col2'].isin(df2['col1'])) 
0 False 
1 False 
2  True 
Name: col2, dtype: bool 

print (df1[~df1['col2'].isin(df2['col1'])]) 
    col1 col2 
2  2  4 

時序:使用.query()方法

In [8]: %timeit (df1.query('col2 not in @df2.col1')) 
1000 loops, best of 3: 1.57 ms per loop 

In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])]) 
1000 loops, best of 3: 466 µs per loop 
1

In [9]: df1.query('col2 not in @df2.col1') 
Out[9]: 
    col1 col2 
2  2  4 

更大DF的計時:

In [44]: df1.shape 
Out[44]: (30000000, 2) 

In [45]: df2.shape 
Out[45]: (20000000, 1) 

In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])]) 
1 loop, best of 3: 5.56 s per loop 

In [47]: %timeit (df1.query('col2 not in @df2.col1')) 
1 loop, best of 3: 5.96 s per loop