加速將時間戳轉換爲日期時間Python

我正在使用python和pytables製作期貨市場tick數據重放系統，並使用一個相當大的數據集（+ 200GB）。加速將時間戳轉換爲日期時間Python

據我所知，pytables只能爲我的時間戳存儲numpy datetime64對象。這是一個問題，因爲我需要將它們轉換爲日期時間對象或熊貓時間戳，以便交易模塊可以調用傳入數據上的時間或工作日或月份等方法。試圖在運行時轉換數十億行基本上使系統無法使用。

pd.to_datetime(my_datetime64) 
datetime.datetime(my_datetime64)

都是太慢了。

這裏是我如何導入我的成千上萬的原始cvs到pytables商店。需要注意的是，指數在大熊貓日期時間格式，讓我獲得有關時間戳，如時間，月，年等信息

from pandas import HDFStore 
store = HDFStore(store_dir) 

for file in files: 
      df = pd.read_csv("/TickData/"+file) 
      df.index = pd.to_datetime(df['date'].apply(str) + " " + df['time'], format = '%Y%m%d %H:%M:%S.%f') 
      df.drop(['date', 'time'], axis=1, inplace=True) 
      store.append('ticks', df, complevel=9, complib='blosc')

這裏就是數據的模樣，當我讀回塊與PyTables表.read方法 - 你可以看到大熊貓時間戳都見怪不怪轉換成datetime64

array([(1220441851000000000, [b'ESU09'], [1281.0], [1]), 
     (1226937439000000000, [b'ESU09'], [855.75], [2]), 
     (1230045292000000000, [b'ESU09'], [860.0], [1]), ..., 
     (1244721917000000000, [b'ESU09'], [943.75], [1]), 
     (1244721918000000000, [b'ESU09'], [943.75], [2]), 
     (1244721920000000000, [b'ESU09'], [944.0], [15])], 
     dtype=[('index', '<i8'), ('values_block_0', 'S5', (1,)), ('values_block_1', '<f8', (1,)), ('values_block_2', '<i8', (1,))])

這裏就是我在成批讀出來的表

chunksize = 100000 
    nrows = 1000000000 
    n_chunks = nrows//chunksize + 1 
    h5f = tables.open_file(store_directory, 'r') 
    t = h5f.get_node('/', 'ticks') 

    for i in range(n_chunks): 
     chunk = t.table.read(i*chunksize, (i+1)*chunksize) 
      for c in chunk: 
        #this is where we would convert c[0] which is the timestamp , 
pd.to_datetime(c[0]) or datetime.datetime(c[0]), both are too slow

我的問題最終是：

1：有沒有更快的方式將datetime64的背景轉換爲日期時間或熊貓時間戳，或許是與cython有關的？

OR 2：有沒有辦法將大熊貓時間戳存儲在HDF中，以便它們不需要在讀取時轉換？

感謝

來源

2016-06-10 Brom Quinn

如何獨特的是你的時間戳分辨率是多少？ – rrauenza

@rrauenza，他們可能是85％的獨特，毫秒分辨率 –

好吧，如果這些值重複了很多，是否會建議lru memoization。 – rrauenza

試試這個：

import numpy 
from datetime import datetime 

npdt = numpy.datetime64(datetime.utcnow()) 
dt = npdt.astype(datetime)

我發現它是更快了一個數量級：

from datetime import datetime 
import numpy 
import pandas 
import timeit 

foo = numpy.datetime64(datetime.utcnow()) 
print(foo.astype(datetime)) 
print(pandas.to_datetime(foo)) 

print(timeit.timeit('foo.astype(datetime)', setup='import numpy; import pandas; from datetime import datetime; foo = numpy.datetime64(datetime.utcnow())')) 
print(timeit.timeit('pandas.to_datetime(foo)', setup='import numpy; import pandas; from datetime import datetime; foo = numpy.datetime64(datetime.utcnow())'))

輸出：

2016-06-10 20:51:11.745616 
2016-06-10 20:51:11.745616 
1.916042190976441 
37.38387820869684

來源

2016-06-10 20:49:07 rrauenza

我會當我回家時給這個鏡頭，謝謝！ –

我用這個問題作爲參考：http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64 – rrauenza

加速將時間戳轉換爲日期時間Python

回答

相關問題