2016-07-07 94 views
3

我在大型數據庫中爲每個用戶使用這種熊貓DataFrame。從稀疏日期時間索引獲取範圍

enter image description here

每一行是一個週期長度[日期,結束日期]中,但有時2個連續行被實際上同期:end_date等於以下start_date(紅色下劃線)。有時期間甚至在多於一天的日期重疊。

我想通過組合對應於相同週期的行來獲得「實時期」。

什麼我試圖

def split_range(name): 
    df_user = de_201512_echant[de_201512_echant.name == name] 
    # -- Create a date_range with a length [min_start_date, max_start_date] 
    t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date) 
    for row in range(0, df_user.shape[0]): 
     start_date = df_user.iloc[row].start_date 
     end_date = df_user.iloc[row].end_date 
     if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)): 
      t = pd.DataFrame(index=pd.date_range(start_date, end_date)) 
      t["period_%s" % (row)] = 1 
      t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left") 
     else: 
      pass 

    return t_date 

這產生一個數據幀,其中每個colunms是一個週期(1,如果在範圍內,爲NaN如果不):

t_date 
Out[29]: 
      period_0 period_1 period_2 period_3 period_4 period_5 \ 
2005-01-01  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-02  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-03  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-04  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-05  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-06  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-07  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-08  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-09  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-10  NaN  NaN  NaN  NaN  NaN  NaN 
2005-01-11  NaN  NaN  NaN  NaN  NaN  NaN 

然後,如果我總結所有列(期間)我幾乎正是我想要的:

full_spell = t_date.sum(axis=1) 
full_spell.loc[full_spell == 1] 

Out[31]: 
2005-11-14 1.0 
2005-11-15 1.0 
2005-11-16 1.0 
2005-11-17 1.0 
2005-11-18 1.0 
2005-11-19 1.0 
2005-11-20 1.0 
2005-11-21 1.0 
2005-11-22 1.0 
2005-11-23 1.0 
2005-11-24 1.0 
2005-11-25 1.0 
2005-11-26 1.0 
2005-11-27 1.0 
2005-11-28 1.0 
2005-11-29 1.0 
2005-11-30 1.0 
2006-01-16 1.0 
2006-01-17 1.0 
2006-01-18 1.0 
2006-01-19 1.0 
2006-01-20 1.0 
2006-01-21 1.0 
2006-01-22 1.0 
2006-01-23 1.0 
2006-01-24 1.0 
2006-01-25 1.0 
2006-01-26 1.0 
2006-01-27 1.0 
2006-01-28 1.0 

2015-07-06 1.0 
2015-07-07 1.0 
2015-07-08 1.0 
2015-07-09 1.0 
2015-07-10 1.0 
2015-07-11 1.0 
2015-07-12 1.0 
2015-07-13 1.0 
2015-07-14 1.0 
2015-07-15 1.0 
2015-07-16 1.0 
2015-07-17 1.0 
2015-07-18 1.0 
2015-07-19 1.0 
2015-08-02 1.0 
2015-08-03 1.0 
2015-08-04 1.0 
2015-08-05 1.0 
2015-08-06 1.0 
2015-08-07 1.0 
2015-08-08 1.0 
2015-08-09 1.0 
2015-08-10 1.0 
2015-08-11 1.0 
2015-08-12 1.0 
2015-08-13 1.0 
2015-08-14 1.0 
2015-08-15 1.0 
2015-08-16 1.0 
2015-08-17 1.0 
dtype: float64 

但我無法找到一種方法來分割此稀疏日期時間索引的所有時間範圍,以最終獲得我期望的輸出:包含「真實」時間段的原始數據幀。

這可能不是最有效的方法,所以如果您有其他選擇,請不要猶豫!

回答

0

我發現了一個更有效的方式使用apply做到這一點:

def get_range(row): 
    '''returns a DataFrame containing the day-range from a "start_date" 
    and a "end_date"''' 
    start_date = row["start_date"] 
    end_date = row["end_date"] 
    period = pd.date_range(start_date, end_date, freq="1D") 

    return pd.Dataframe(period, columns='days_in_period') 

# -- Apply get_range() to the initial df 
t_all = df.apply(get_range) 
# -- Drop overlapping dates 
t_all.drop_duplicates(inplace=True)