2017-05-18 45 views
1

我最初的數據幀(DF)的返回日期時間列:熊貓據幀 - 在np.where聲明

 column1  column2 column3 column4 
0 criteria_1 criteria_a 1/5/2017  5 
1 criteria_1 criteria_b 2/3/2017  3 
2 criteria_1 criteria_a 1/10/2017  10 
3 criteria_1 criteria_b 2/7/2017  7 
4 criteria_1 criteria_b 2/11/2017  11 
5 criteria_1 criteria_a 1/13/2017  13  

我的代碼:

df = pd.read_csv("C:/Users/Desktop/maxtest.csv") 
    df['column3'] = pd.to_datetime(df['column3']) 
    df['max_column3'] = df.groupby(['column1','column2'])['column3'].transform(max) 
    df['max_column4'] = df.groupby(['column1','column2'])['column4'].transform(max) 
    df['test'] = np.where(df['column3'] < df['max_column3'],df['column3'],df['max_column4']) 

問題:

我創建了一個DF ['test']列,並希望在np.where語句爲True時返回df ['column3']。當我嘗試這個時,我收到「TypeError:invalid type promotion」錯誤。

我不完全確定是什麼導致了錯誤。

+2

我認爲問題在於你把np.where的結果混合在一起。有時它會在其他時間返回一個日期時間,它返回一個str或int。熊貓數據框和numpy NDarrays每列需要一個dtype。我能夠通過df.column3上的.astype(str)解決此錯誤。 –

回答

0

查看我的評論的解釋。

df['column3'] = pd.to_datetime(df['column3']) 
df['max_column3'] = df.groupby(['column1','column2'])['column3'].transform(max) 
df['max_column4'] = df.groupby(['column1','column2'])['column4'].transform(max) 
df['test'] = np.where((df['column3'] < df['max_column3']),df.column3.astype(str),df.max_column4) 

輸出:

 column1  column2 column3 column4 max_column3 max_column4 \ 
0 criteria_1 criteria_a 2017-01-05  5 2017-01-13   13 
1 criteria_1 criteria_b 2017-02-03  3 2017-02-11   11 
2 criteria_1 criteria_a 2017-01-10  10 2017-01-13   13 
3 criteria_1 criteria_b 2017-02-07  7 2017-02-11   11 
4 criteria_1 criteria_b 2017-02-11  11 2017-02-11   11 
5 criteria_1 criteria_a 2017-01-13  13 2017-01-13   13 

     test 
0 2017-01-05 
1 2017-02-03 
2 2017-01-10 
3 2017-02-07 
4   11 
5   13 
0

如果你想保留的日期時間格式,你可以這樣做:

df['test'] = df.apply(lambda x: x.column3 if x.column3 < x.max_column3 else x.max_column4, axis=1) 

df 
Out[1291]: 
     column1  column2 column3 column4 max_column3 max_column4 \ 
0 criteria_1 criteria_a 2017-01-05  5 2017-01-13   13 
1 criteria_1 criteria_b 2017-02-03  3 2017-02-11   11 
2 criteria_1 criteria_a 2017-01-10  10 2017-01-13   13 
3 criteria_1 criteria_b 2017-02-07  7 2017-02-11   11 
4 criteria_1 criteria_b 2017-02-11  11 2017-02-11   11 
5 criteria_1 criteria_a 2017-01-13  13 2017-01-13   13 

        test 
0 2017-01-05 00:00:00 
1 2017-02-03 00:00:00 
2 2017-01-10 00:00:00 
3 2017-02-07 00:00:00 
4     11 
5     13 
0

我最終使用的標準功能,做:

import pandas as pd 
import numpy as np 

    df = pd.read_csv("C:/Users/andre_000/Desktop/maxtest.csv") 
    df['column3'] = pd.to_datetime(df['column3']) 
    df['max_column3'] = df.groupby(['column1','column2'])['column3'].transform(max) 
    df['max_column4'] = df.groupby(['column1','column2'])['column4'].transform(max) 


    def func(row): 
     if row['column3'] < row['max_column3']: 
      return row['column3'] 
     else: 
      return row['max_column4'] 


    df = df.assign(test=df.apply(func, axis=1))