2015-11-04 76 views
1

這與以前的問題有關:Python pandas change duplicate timestamp to unique,因此與此名稱相似。Python熊貓將可變數量的重複時間戳更改爲唯一

的附加要求是均勻地處理第二邊界之間每秒和他們的空間了多個副本,即

.... 
2011/1/4 9:14:00 
2011/1/4 9:14:00 
2011/1/4 9:14:01 
2011/1/4 9:14:01 
2011/1/4 9:14:01 
2011/1/4 9:14:01 
2011/1/4 9:14:01 
2011/1/4 9:15:02 
2011/1/4 9:15:02 
2011/1/4 9:15:02 
2011/1/4 9:15:03 
.... 

應該成爲

.... 
2011/1/4 9:14:00 
2011/1/4 9:14:00.500 
2011/1/4 9:14:01 
2011/1/4 9:14:01.200 
2011/1/4 9:14:01.400 
2011/1/4 9:14:01.600 
2011/1/4 9:14:01.800 
2011/1/4 9:14:02 
2011/1/4 9:14:02.333 
2011/1/4 9:14:02.666 
2011/1/4 9:14:03 
.... 

我難倒就如何應對變化重複數量。

我認爲沿着groupby()的路線,但不能解決問題。我一直在想這已經是一個普通的用例了,所以我們非常感謝所有幫助。

回答

1

我將日期時間列轉換爲timedelta[ms]。但問題是數字太大,所以首先我將年份換算爲epoch time - 2011 - 1970。然後,我計算了差異,並將其添加到第一列中:df['one'] = df['one'] - df['new'] + df['timedelta'].然後將以整數爲單位的timedeltas以毫秒爲單位轉換爲timedeltas,並將最後一次添加爲2011 - 1970

#     time 
#0 2011-01-04 09:14:00 
#1 2011-01-04 09:14:00 
#2 2011-01-04 09:14:01 
#3 2011-01-04 09:14:01 
#4 2011-01-04 09:14:01 
#5 2011-01-04 09:14:01 
#6 2011-01-04 09:14:01 
#7 2011-01-04 09:15:02 
#8 2011-01-04 09:15:02 
#9 2011-01-04 09:15:02 
#10 2011-01-04 09:15:03 
#time datetime64[ns] 

#remove years for less timedeltas 
df['time1'] = df['time'].apply(lambda x: x - pd.DateOffset(years=2011-1970)) 
#convert time to timedeltas in miliseconds 
df['timedelta'] = pd.to_timedelta(df['time1'])/np.timedelta64(1, 'ms') 
df['one'] = 1 
#count differences by groupby and transforming mean/sum 
m = lambda x: (x.mean())/x.sum() 
df['one'] = df.groupby('time')['one'].transform(m) 
#cast float to integer 
df['new'] = (df['one']*1000).astype(int) 
#need differences by cumulative sum 
df['one'] = df.groupby('time')['new'].transform(np.cumsum) 
#column cumulatice sum substracting differences and added timedelta 
df['one'] = df['one'] - df['new'] + df['timedelta'] 
#convert integer to timedelta 
df['final'] = pd.to_timedelta(df['one'],unit='ms') 
#add removed years 
df['final'] = df['final'].apply(lambda x: pd.to_datetime(x) + pd.DateOffset(years=2011-1970)) 
#remove unnecessary columns 
df = df.drop(['time1', 'timedelta', 'one', 'new'], axis=1) 
print df 
#     time     final 
#0 2011-01-04 09:14:00 2011-01-04 09:14:00.000 
#1 2011-01-04 09:14:00 2011-01-04 09:14:00.500 
#2 2011-01-04 09:14:01 2011-01-04 09:14:01.000 
#3 2011-01-04 09:14:01 2011-01-04 09:14:01.200 
#4 2011-01-04 09:14:01 2011-01-04 09:14:01.400 
#5 2011-01-04 09:14:01 2011-01-04 09:14:01.600 
#6 2011-01-04 09:14:01 2011-01-04 09:14:01.800 
#7 2011-01-04 09:15:02 2011-01-04 09:15:02.000 
#8 2011-01-04 09:15:02 2011-01-04 09:15:02.333 
#9 2011-01-04 09:15:02 2011-01-04 09:15:02.666 
#10 2011-01-04 09:15:03 2011-01-04 09:15:03.000