2017-09-04 39 views
3

問題Python的性能改進和編碼風格

假設下稀疏表中給出指示的索引安全上市。

identifier from  thru 
AAPL   1964-03-31 -- 
ABT   1999-01-03 2003-12-31 
ABT   2005-12-31 -- 
AEP   1992-01-15 2017-08-31 
KO   2014-12-31 -- 

ABT例如是從1999年1月3日指數從2005-12-312003-12-31又一次,直到今天( - 表示今天)。在兩次之間它沒有在索引上列出。

我怎樣纔能有效地改造這個疏表到以下形式的密表

date   AAPL ABT AEP KO 
1964-03-31  1  0 0 0 
1964-04-01  1  0 0 0 
...   ... ... ... ... 
1999-01-03  1  1 1 0 
1999-01-04  1  1 1 0 
...   ... ... ... ... 
2003-12-31  1  1 1 0 
2004-01-01  1  0 1 0 
...   ... ... ... ... 
2017-09-04  1  1 0 1 

在部分我的解決方案,你會發現我的解決問題的辦法。不幸的是,代碼似乎表現非常糟糕。大約需要22秒來處理1648個條目。

因爲我是新來的python,我想知道如何有效地編程這樣的問題。

我不打算任何人提供我解決我的問題(除非你希望這樣做)。我的主要目標是瞭解如何有效地解決像Python這樣的問題。我使用熊貓的功能來匹配相應的條目。我應該使用numpy和索引嗎?我應該使用其他工具箱嗎?我如何獲得性能改進?

請在下面的章節中找到我對這個問題的解決方法(如果您感興趣的話)。

非常感謝您的幫助


我的解決方案

我試圖通過在第一個表的每一行進入循環,從而解決問題。在每個單一循環中,我從指定特定的布爾矩陣 - 所有元素都設置爲True的區間。這個矩陣被附加到列表中。最後,我pd.concat列表並取消堆棧並重新生成結果DataFrame。

import pandas as pd 
import numpy as np 

def get_ts_data(data, start_date, end_date, attribute=None, identifier=None, frequency=None): 
    """ 
    Transform sparse table to dense table. 

    Parameters 
    ---------- 
    data: pd.DataFrame 
     sparse table with minimal column specification ['identifier', 'from', 'thru' 
    start_date: pd.Timestamp, str 
     start date of the dense matrix 
    end_date: pd.Timestamp, str 
     end date of the dense matrix 
    attribute: str 
     column name of the value of the dense matrix. 
    identifier: str 
     column name of the identifier 
    frequency: str 
     frequency of the dense matrix 
    kwargs: 
     Allows to overwrite naming of 'from' and 'thru' variables. 

     e.g. 

     {'from': 'start', 'thru': 'end'} 

    Returns 
    ------- 

    """ 

    if attribute is None: 
     attribute = ['on_index'] 
    elif not isinstance(attribute, list): 
     attribute = [attribute] 

    if identifier is None: 
     identifier = ['identifier'] 
    elif not isinstance(identifier, list): 
     identifier = [identifier] 

    if frequency is None: 
     frequency = 'B' 

    # copy data for security reasons 
    data_mod = data.copy() 
    data_mod['on_index'] = True 

    # specify start date and check type 
    if not isinstance(start_date, pd.Timestamp): 
     start_date = pd.Timestamp(start_date) 

    # specify end date and check type 
    if not isinstance(end_date, pd.Timestamp): 
     end_date = pd.Timestamp(end_date) 

    # specify output date range 
    date_range = pd.date_range(start_date, end_date, freq=frequency) 

    #overwrite null indicating that it is valid until today 
    missing = data_mod['thru'].isnull() 
    data_mod.loc[missing, 'thru'] = data_mod.loc[missing, 'from'].apply(lambda d: max(d, end_date)) 

    # preallocate frms 
    frms = [] 

    # add dataframe to frms with time specific entries 
    for index, row in data_mod.iterrows(): 
     # date range index 
     d_range = pd.date_range(row['from'], row['thru'], freq=frequency) 

     # Multi index with date and identifier 
     d_index = pd.MultiIndex.from_product([d_range] + [[x] for x in row[identifier]], names=['date'] + identifier) 

     # add DataFrame with repeated values to list 
     frms.append(pd.DataFrame(data=np.repeat(row[attribute].values, d_index.size), index=d_index, columns=attribute)) 

    out_frame = pd.concat(frms) 
    out_frame = out_frame.unstack(identifier) 
    out_frame = out_frame.reindex(date_range) 

    return out_frame 

if __name__ == "__main__": 
    data = pd.DataFrame({'identifier': ['AAPL', 'ABT', 'ABT', 'AEP', 'KO'], 
         'from': [pd.Timestamp('1964-03-31'), 
            pd.Timestamp('1999-01-03'), 
            pd.Timestamp('2005-12-31'), 
            pd.Timestamp('1992-01-15'), 
            pd.Timestamp('2014-12-31')], 
         'thru': [np.nan, 
            pd.Timestamp('2003-12-31'), 
            np.nan, 
            pd.Timestamp('2017-08-31'), 
            np.nan] 
         }) 

    transformed_data = get_ts_data(data, start_date='1964-03-31', end_date='2017-09-04', attribute='on_index', identifier='identifier', frequency='B') 
    print(transformed_data) 

回答

2
# Ensure dates are Pandas timestamps. 
df['from'] = pd.DatetimeIndex(df['from']) 
df['thru'] = pd.DatetimeIndex(df['thru'].replace('--', np.nan)) 

# Get sorted list of all unique dates and create index for full range. 
dates = sorted(set(df['from'].tolist() + df['thru'].dropna().tolist())) 
dti = pd.DatetimeIndex(start=dates[0], end=dates[-1], freq='B') 

# Create new target dataframe based on symbols and full date range. Initialize to zero. 
df2 = pd.DataFrame(0, columns=df['identifier'].unique(), index=dti) 

# Find all active symbols and set their symbols' values to one from their respective `from` dates. 
for _, row in df[df['thru'].isnull()].iterrows(): 
    df2.loc[df2.index >= row['from'], row['identifier']] = 1 

# Find all other symbols and set their symbols' values to one between their respective `from` and `thru` dates. 
for _, row in df[df['thru'].notnull()].iterrows(): 
    df2.loc[(df2.index >= row['from']) & (df2.index <= row['thru']), row['identifier']] = 1 

>>> df2.head(3) 
      AAPL ABT AEP KO 
1964-03-31  1 0 0 0 
1964-04-01  1 0 0 0 
1964-04-02  1 0 0 0 

>>> df2.tail(3) 
      AAPL ABT AEP KO 
2017-08-29  1 1 1 1 
2017-08-30  1 1 1 1 
2017-08-31  1 1 1 1 

>>> df2.loc[:'2004-01-02', 'ABT'].tail() 
2003-12-29 1 
2003-12-30 1 
2003-12-31 1 
2004-01-01 0 
2004-01-02 0 
Freq: B, Name: ABT, dtype: int64 

>>> df2.loc['2005-12-30':, 'ABT'].head(3) 
2005-12-30 0 
2006-01-02 1 
2006-01-03 1 
Freq: B, Name: ABT, dtype: int64 
+0

感謝@Alexander。我很感激。這非常整齊。我的示例中,您的解決方案速度快了大約76倍。 – quantguy