2013-02-23 79 views
1

說,我們建立了一個DF:熊貓 - ASOF()由去年的數據框

import pandas as pd 
import random as randy 
import numpy as np 
df_size = int(1e6) 
df = pd.DataFrame({'first':  randy.sample(np.repeat([np.NaN,'Cat','Dog','Bear','Fish'],df_size),df_size), 
       'second': randy.sample(np.repeat([np.NaN,np.NaN,'Cat','Dog'],df_size),df_size), 
       'value': range(df_size)}, 
       index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),df_size)).sort_index() 

它看起來是這樣的:

      first second value 
2013-02-01 09:00:00   Fish Cat  95409 
2013-02-01 09:00:00.000001 Dog  Dog  323089 
2013-02-01 09:00:00.000002 Fish Cat  785925 
2013-02-01 09:00:00.000003 Dog  Cat  866171 
2013-02-01 09:00:00.000004 nan  nan  665702 
2013-02-01 09:00:00.000005 Cat  nan  104257 
2013-02-01 09:00:00.000006 nan  nan  152926 
2013-02-01 09:00:00.000007 Bear Cat  707747 

我想是在每個值'第二'欄,我想要第一個'最後'的價值。

      first second value new_value 
2013-02-01 09:00:00   Fish  Cat  95409 NaN 
2013-02-01 09:00:00.000001 Dog  Dog  323089 323089 
2013-02-01 09:00:00.000002 Fish Cat  785925 NaN 
2013-02-01 09:00:00.000003 Dog  Cat  866171 NaN 
2013-02-01 09:00:00.000004 nan  nan  665702 NaN 
2013-02-01 09:00:00.000005 Cat  nan  104257 NaN 
2013-02-01 09:00:00.000006 nan  nan  152926 NaN 
2013-02-01 09:00:00.000007 Bear Cat  707747 104257 

也許,這不是絕對的最好的例子,但在底部,當「第二個」是「貓」,我想最近的值當「第一」是「貓」

真實數據集有1000多個類別,因此循環遍歷符號並執行asof()看起來過於昂貴。我從來沒有任何運氣字符串傳遞在用Cython,但我想映射符號整數,並做了蠻力循環會工作 - 我希望的東西更Python。 (這仍然是相當快的)

的引用,有些脆弱用Cython黑客是:

%%cython 
import numpy as np 
import sys 
cimport cython 
cimport numpy as np 

ctypedef np.double_t DTYPE_t 

def last_of(np.ndarray[DTYPE_t, ndim=1] some_values,np.ndarray[long, ndim=1] first_sym,np.ndarray[long, ndim=1] second_sym): 
    cdef long val_len = some_values.shape[0], sym1_len = first_sym.shape[0], sym2_len = second_sym.shape[0], i = 0 
    assert(sym1_len==sym2_len) 
    assert(val_len==sym1_len) 
    cdef int enum_space_size = max(first_sym)+1 

    cdef np.ndarray[DTYPE_t, ndim=1] last_values = np.zeros(enum_space_size, dtype=np.double) * np.NaN 
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(val_len, dtype=np.double) * np.NaN 
    for i in range(0,val_len): 
     if first_sym[i]>=0: 
      last_values[first_sym[i]] = some_values[i] 
     if second_sym[i]<0 or second_sym[i]>=enum_space_size: 
      res[i] = np.NaN 
     else: 
      res[i] = last_values[second_sym[i]] 
    return res 

然後一些字典更換廢話:

syms= unique(df['first'].values) 
enum_dict = dict(zip(syms,range(0,len(syms)))) 
enum_dict['nan'] = -1 
df['enum_first'] = df['first'].replace(enum_dict) 
df['enum_second'] = df['second'].replace(enum_dict) 
df['last_value'] = last_of(df.value.values*1.0,df.enum_first.values.astype(int64),df.enum_second.values.astype(int64)) 

這樣做的問題是,如果「第二'列有任何值不在第一,你有問題。 (我不知道一個快速的方法來解決這個問題...說,如果你添加'驢'的第二個)

每1000萬行cythonic愚蠢的版本是21秒整個混亂,但只〜2爲cython部分。 (這可能是做了一個體面的數額更快)

@HYRY - 我認爲這是一個非常堅實的解決方案;在我的筆記本電腦上有一千萬行的DF,這對我來說需要大約30秒。

既然我不知道一個簡單的方法來處理,當第二列表具有條目沒有在第一,除了一個相當昂貴ISIN,我覺得HYRY的Python版本是相當不錯的。

回答

3

如何使用字典,以保持每個類別的最後一個值,而ITER中的所有行數據幀:

import pandas as pd 
import random as randy 
import numpy as np 
np.random.seed(1) 
df_size = int(1e2) 
df = pd.DataFrame({'first':  randy.sample(np.repeat([None,'Cat','Dog','Bear','Fish'],df_size),df_size), 
       'second': randy.sample(np.repeat([None,None,'Cat','Dog'],df_size),df_size), 
       'value': range(df_size)}, 
       index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),df_size)).sort_index() 

last_values = {} 
new_values = [] 
for row in df.itertuples(): 
    t, f, s, v = row  
    last_values[f] = v 
    if s is None: 
     new_values.append(None) 
    else: 
     new_values.append(last_values.get(s, None)) 
df["new_value"] = new_values 

結果是

      first second value new_value 
2013-02-01 09:00:00.010373 Cat None  87  None 
2013-02-01 09:00:00.013015 Cat Dog  69  None 
2013-02-01 09:00:00.024910 Fish Cat  1  69 
2013-02-01 09:00:00.025943 Cat None  98  None 
2013-02-01 09:00:00.041318 Fish Dog  66  None 
2013-02-01 09:00:00.057894 None None  36  None 
2013-02-01 09:00:00.059678 None None  50  None 
2013-02-01 09:00:00.067228 Bear None  38  None 
2013-02-01 09:00:00.095867 Bear Cat  84  98 
2013-02-01 09:00:00.096867 Dog Cat  97  98 
2013-02-01 09:00:00.101540 Dog Dog  76  76 
2013-02-01 09:00:00.106753 Dog None  22  None 
2013-02-01 09:00:00.138936 None None  8  None 
2013-02-01 09:00:00.139273 Bear Cat  2  98 
2013-02-01 09:00:00.143180 Fish None  94  None 
2013-02-01 09:00:00.184757 None Cat  73  98 
2013-02-01 09:00:00.193063 None None  5  None 
2013-02-01 09:00:00.231056 Fish Cat  62  98 
2013-02-01 09:00:00.237658 None None  64  None 
2013-02-01 09:00:00.240178 Bear Dog  80  22 
+0

這是相當不錯的 - 這真是一個很大比我預想的要快。謝謝! – radikalus 2013-02-24 16:57:32

0

老問題,但我知道這裏的避免任何Python循環的解決方案。 第一步是獲得一個時間序列的'value'每個類別。 您可以通過拆垛做到這一點:

first_values = df.dropna(subset=['first']).set_index('first', append=True).value.unstack()  
second_values = df.dropna(subset=['second']).set_index('second', append=True).value.unstack() 

注意,如果列包含真正NaN值,而不是'nan'字符串(做df = df.replace('nan', np.nan)如果必要準備),這隻會工作。

然後你可以通過正向使用原始'time', 'second'對填充first_values,重建索引一樣,再次和索引疊加到結果的最後第一個值:

ix = pd.MultiIndex.from_arrays([df.index, df.second]) 
new_value = first_values.ffill().reindex_like(second_values).stack().reindex(ix) 
df['new_value'] = new_value.values 

In [1649]: df 
Out[1649]: 
          first second value new_value 
2013-02-01 09:00:00.000000 Fish Cat  95409 NaN 
2013-02-01 09:00:00.000001 Dog  Dog  323089 323089 
2013-02-01 09:00:00.000002 Fish Cat  785925 NaN 
2013-02-01 09:00:00.000003 Dog  Cat  866171 NaN 
2013-02-01 09:00:00.000004 NaN  NaN  665702 NaN 
2013-02-01 09:00:00.000005 Cat  NaN  104257 NaN 
2013-02-01 09:00:00.000006 NaN  NaN  152926 NaN 
2013-02-01 09:00:00.000007 Bear Cat  707747 104257