Keras中的SSD實施訓練在幾次迭代之後停止，沒有任何輸出或錯誤

經過第一次迭代的幾次迭代後，訓練過程停止，沒有任何輸出或錯誤消息。在Keras SSD實現從https://github.com/rykov8/ssd_keras Keras中的SSD實施訓練在幾次迭代之後停止，沒有任何輸出或錯誤

base_lr = 3e-4 
#optim = keras.optimizers.Adam(lr=base_lr) 
optim = keras.optimizers.RMSprop(lr=base_lr) 
#optim = keras.optimizers.SGD(lr=base_lr, momentum=0.9, decay=decay, nesterov=True) 
model.compile(optimizer=optim, 
       loss=MultiboxLoss(NUM_CLASSES+1, neg_pos_ratio=2.0).compute_loss) 



nb_epoch = 10 
history = model.fit_generator(gen.generate(True), gen.train_batches, 
           nb_epoch, verbose=1, 
           callbacks=None, 
           validation_data=gen.generate(False), 
           nb_val_samples=gen.val_batches, 
           nb_worker=1 
           )

的程序的輸出是用如下：

Epoch 1/10 
/home/deepesh/Documents/ssd_traffic/ssd_utils.py:119: RuntimeWarning: divide by zero encountered in log 
    assigned_priors_wh) 
2017-10-15 18:00:53.763886: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:02.602807: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:03.831092: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:03.831138: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.10GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:04.774444: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:05.897872: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:05.897923: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:09.133494: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:09.133541: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:11.266114: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
13/14 [==========================>...] - ETA: 9s - loss: 2.9617

沒有輸出或錯誤消息之後。

來源

2017-10-15 Deepesh Lekhak

您沒有足夠的內存，你可以做的事情來解決這個問題：

減少批量
減少列車數據的大小
訓練你的模型雲（ AMS，谷歌雲和等）
使用另一個GPU卡的內存
或嘗試CPU

來源

2017-10-15 14:35:22 Paddy

我已經在AMS g2.8xlarge實例上訓練模型，但是問題沒有解決。當我將批量減少到2時，問題就解決了。 –

很好聽:) – Paddy

Keras中的SSD實施訓練在幾次迭代之後停止，沒有任何輸出或錯誤

回答

相關問題