匹配熊貓DataFrames之間的ID和應用功能

DF_A：

ID x  y 
a  0  0 
c  3  2 
b  2  5

DF_B：

ID x  y 
a  2  1 
c  3  5 
b  1  2

我想增加一列在db_B中，它是每個標識符的df_A中df_B中x，y座標之間的歐幾里得距離。期望的結果是：

ID x  y dist 
a  2  1 1.732 
c  3  5 3 
b  1  2 3.162

標識符不一定會以相同的順序。我知道如何通過循環遍歷df_A的行並在df_B中找到匹配的ID來實現這一點，但我希望避免使用for循環，因爲這將用於具有數千萬行的數據。有什麼方法可以使用apply，但在匹配ID時使用它？

來源

2016-12-24 Megan

發佈的解決方案是否適合您？ – Divakar

如果ID不是索引，就這樣做。

df_B.set_index('ID', inplace=True) 
df_A.set_index('ID', inplace=True) 

df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5

由於索引和列已經對齊，所以只需要做數學運算即可。它採用sklearn.metrics.pairwise.paired_distances方法

來源

2016-12-24 22:48:12 piRSquared

不錯的解決方案！ – MaxU

解決方案：

In [73]: A 
Out[73]: 
    x y 
ID 
a 0 0 
c 3 2 
b 2 5 

In [74]: B 
Out[74]: 
    x y 
ID 
a 2 1 
c 3 5 
b 1 2 

In [75]: from sklearn.metrics.pairwise import paired_distances 

In [76]: B['dist'] = paired_distances(B, A) 

In [77]: B 
Out[77]: 
    x y  dist 
ID 
a 2 1 2.236068 
c 3 5 3.000000 
b 1 2 3.162278

來源

2016-12-24 23:13:54 MaxU

耶！我可以再投票。 – piRSquared

出於性能考慮，您可能希望與NumPy的陣列和相應的行之間的歐氏距離的計算工作，np.einsum將是非常有效地做到這一點。

合併行，使它們一致的固定，這裏是一個實現 -

# Get sorted row indices for dataframe-A 
sidx = df_A.index.argsort() 
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)] 

# Sort A rows accordingly and get the elementwise differences against B 
s = df_A.values[idx] - df_B.values 

# Use einsum to square and sum each row and finally sqrt for distances 
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))

樣品輸入，輸出 -

In [121]: df_A 
Out[121]: 
    0 1 
a 0 0 
c 3 2 
b 2 5 

In [122]: df_B 
Out[122]: 
    0 1 
c 3 5 
a 2 1 
b 1 2 

In [124]: df_B # After code run 
Out[124]: 
    0 1  dist 
c 3 5 3.000000 
a 2 1 2.236068 
b 1 2 3.162278

這裏有一個runtime test對其他幾個同行比較einsum。

來源

2016-12-25 07:48:18 Divakar

匹配熊貓DataFrames之間的ID和應用功能

回答

相關問題