2017-01-23 40 views
1

我有兩個dataframes,main_df字符串值來填充第二數據幀

| header_1 
0 | value_1 
1 | value_2 
2 | value_3 
3 | value_1 

和查找數據幀lookup_df

| header_1 | header_2 
0 | value_1 | lookup_value_1 
1 | value_2 | lookup_value_2 
2 | value_3 | lookup_value_3 
3 | value_4 | lookup_value_4 

main_df的值不是唯一的。 `lookup_df'中的值是唯一的。

我只是想在main df中填入一個新列,並且相應的lookup_valuelookup_df

已經嘗試了各種方法,包括.merge,.join,.map.lookup

main_df = pd.merge(main_df, lookup_df, how='inner', on=['header_1']) 

我找的結果是:

| header_1 | header_2 
0 | value_1 | lookup_value_1 
1 | value_2 | lookup_value_2 
2 | value_3 | lookup_value_3 
3 | value_1 | lookup_value_1 
+0

我想你需要'main_df [ 'header_2'] = main_df [ 'header_1']地圖(lookup_df.set_index( 'header_1')」。 header_2'])' – jezrael

+0

也許你想做一個左合併? 'main_df = pd.merge(main_df,lookup_df,如何= '左',就= [ 'header_1'])' – EdChum

+0

@jezrael我試過,但我得到的錯誤'InvalidIndexError:重建索引只與唯一價值指數objects'有效,我的查找值不是唯一的。 – joshi123

回答

1

可以使用map通過Series

main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2']) 
print (main_df) 
    header_1  header_2 
0 value_1 lookup_value_1 
1 value_2 lookup_value_2 
2 value_3 lookup_value_3 
3 value_1 lookup_value_1 

或者更快一點是轉換Seriesto_dict

main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'] 
                 .to_dict()) 
print (main_df) 
    header_1  header_2 
0 value_1 lookup_value_1 
1 value_2 lookup_value_2 
2 value_3 lookup_value_3 
3 value_1 lookup_value_1 

時序

#[400000 rows x 1 columns] 
main_df = pd.concat([main_df]*100000).reset_index(drop=True) 

In [139]: %timeit pd.merge(main_df, lookup_df, how='left', on=['header_1']) 
10 loops, best of 3: 73.1 ms per loop 

In [140]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2']) 
10 loops, best of 3: 35.7 ms per loop 

In [141]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'].to_dict()) 
10 loops, best of 3: 35.1 ms per loop 

編輯:

你需要header_1列的唯一值在lookup_df,一個可能的解決方案是drop_duplicates

print (lookup_df) 
    header_1  header_2 
0 value_1 lookup_value_1 
1 value_2 lookup_value_2 
2 value_3 lookup_value_3 
3 value_1 lookup_value_4 

#keep first value, default parameter keep='first' 
lookup_df = lookup_df.drop_duplicates(['header_1']) 
print (lookup_df) 
    header_1  header_2 
0 value_1 lookup_value_1 
1 value_2 lookup_value_2 
2 value_3 lookup_value_3 

#keep last value 
lookup_df1 = lookup_df.drop_duplicates(['header_1'], keep='last') 
print (lookup_df1) 
    header_1  header_2 
0 value_1 lookup_value_1 
1 value_2 lookup_value_2 
2 value_3 lookup_value_3 
+0

我編輯答案並創建獨特的'lookup_df',請檢查它。 – jezrael

+0

測試了'drop_duplicates'代碼,它的工作原理,非常感謝 – joshi123

0

你要做的合併不'how'關鍵字。像這樣:

main_df = pd.DataFrame([{'header_1': 'value_1'},{'header_1': 'value_2'},{'header_1': 'value_3'},{'header_1': 'value_1'}]) 

lookup_df = pd.DataFrame([{'header_1':'value_1', 'header_2':'lookup_value_1'}, {'header_1':'value_2', 'header_2':'lookup_value_2'}, {'header_1':'value_3', 'header_2':'lookup_value_3'}, {'header_1':'value_4', 'header_2':'lookup_value_4'}]) 

main_df = pd.merge(main_df, lookup_df, on='header_1') 

輸出是

header_1  header_2 
0 value_1 lookup_value_1 
1 value_1 lookup_value_1 
2 value_2 lookup_value_2 
3 value_3 lookup_value_3