2016-02-29 320 views
2

我試圖找到以下幀的兩列之間的時間差異:查找DataFrame中兩列之間的時間差

測試日期|測試類型|初次使用日期


我用下面的函數定義中,以區別:

def days_between(d1, d2): 
    d1 = datetime.strptime(d1, "%Y-%m-%d") 
    d2 = datetime.strptime(d2, "%Y-%m-%d") 
    return abs((d2 - d1).days) 

並能正常工作,但它不採取一系列作爲輸入。所以我不得不建立一個for循環遍歷指數:

age_veh = [] 
for i in range(0, len(data_manufacturer)-1): 
    age_veh[i].append(days_between(data_manufacturer.iloc[i,0], data_manufacturer.iloc[i,4])) 

但是,它返回一個錯誤: IndexError:列表索引超出範圍

我不知道它是否是正確的方式做什麼,我做錯了什麼或替代解決方案將不勝感激。請記住我有大約2百萬行。

+2

爲什麼你不只是將列轉換爲日期時間,然後只是減去列? 'df ['Test Date'] = pd.to_datetime(df ['Test Date']'等等,然後'df ['Test Date'] - df ['First Use Date']'會返回一個timedelta – EdChum

+0

應該這樣做,謝謝! –

回答

0

IIUC你可以先轉換柱to_datetime,使用abs然後轉換timedeltadays

print df 
    id value  date1  date2 sum 
0 A 150 2014-04-08 2014-03-08 NaN 
1 B 100 2014-05-08 2014-02-08 NaN 
2 B 200 2014-01-08 2014-07-08 100 
3 A 200 2014-04-08 2014-03-08 NaN 
4 A 300 2014-06-08 2014-04-08 350 

df['date1'] = pd.to_datetime(df['date1']) 
df['date2'] = pd.to_datetime(df['date2']) 

df['diff'] = (df['date1'] - df['date2']).abs()/np.timedelta64(1, 'D') 
print df 
    id value  date1  date2 sum diff 
0 A 150 2014-04-08 2014-03-08 NaN 31 
1 B 100 2014-05-08 2014-02-08 NaN 89 
2 B 200 2014-01-08 2014-07-08 100 181 
3 A 200 2014-04-08 2014-03-08 NaN 31 
4 A 300 2014-06-08 2014-04-08 350 61 

編輯

我覺得更好的是使用在較大DataFrames轉換np.timedelta64(1, 'D')days,因爲它更快:

我用EdCh嗯sample,只有len(df) = 4k

import io 
import pandas as pd 
import numpy as np 

t=u"""Test Date,Test Type,First Use Date 
2011-02-05,A,2010-01-05 
2012-02-05,A,2010-03-05 
2013-02-05,A,2010-06-05 
2014-02-05,A,2010-08-05""" 

df = pd.read_csv(io.StringIO(t)) 

df = pd.concat([df]*1000).reset_index(drop=True) 

df['Test Date'] = pd.to_datetime(df['Test Date']) 
df['First Use Date'] = pd.to_datetime(df['First Use Date']) 

print (df['Test Date'] - df['First Use Date']).abs().dt.days 

print (df['Test Date'] - df['First Use Date']).abs()/np.timedelta64(1, 'D') 

時序

In [174]: %timeit (df['Test Date'] - df['First Use Date']).abs().dt.days 
10 loops, best of 3: 38.8 ms per loop 

In [175]: %timeit (df['Test Date'] - df['First Use Date']).abs()/np.timedelta64(1, 'D') 
1000 loops, best of 3: 1.62 ms per loop 
2

使用to_datetime那麼你可以減去列產生對abstimedelta轉換列,那麼你可以調用dt.days到獲得總天數,例如:

In [119]: 
import io 
import pandas as pd 
t="""Test Date,Test Type,First Use Date 
2011-02-05,A,2010-01-05 
2012-02-05,A,2010-03-05 
2013-02-05,A,2010-06-05 
2014-02-05,A,2010-08-05""" 
df = pd.read_csv(io.StringIO(t)) 
df 
Out[119]: 
    Test Date Test Type First Use Date 
0 2011-02-05   A  2010-01-05 
1 2012-02-05   A  2010-03-05 
2 2013-02-05   A  2010-06-05 
3 2014-02-05   A  2010-08-05 

In [121]:  
df['Test Date'] = pd.to_datetime(df['Test Date']) 
df['First Use Date'] = pd.to_datetime(df['First Use Date']) 
df.info() 

<class 'pandas.core.frame.DataFrame'> 
Int64Index: 4 entries, 0 to 3 
Data columns (total 3 columns): 
Test Date   4 non-null datetime64[ns] 
Test Type   4 non-null object 
First Use Date 4 non-null datetime64[ns] 
dtypes: datetime64[ns](2), object(1) 
memory usage: 128.0+ bytes 

In [122]: 
df['days'] = (df['Test Date'] - df['First Use Date']).abs().dt.days 
df 

Out[122]: 
    Test Date Test Type First Use Date days 
0 2011-02-05   A  2010-01-05 396 
1 2012-02-05   A  2010-03-05 702 
2 2013-02-05   A  2010-06-05 976 
3 2014-02-05   A  2010-08-05 1280