2017-04-22 203 views
0

遠投: 的Ubuntu 16.04的Nvidia 1070 8Gig在船上?該機擁有64千兆的RAM和數據集爲1萬條記錄和當前的CUDA,CDNN庫,TensorFlow 1.0的Python 3.6TensorFlow Nvidia 1070 GPU內存分配錯誤如何排除故障?

不知道如何解決?

我一直在努力得到一些車型了TensorFlow並已運行到這一現象多次:我不知道以外的任何其他TensorFlow使用GPU內存?

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.645 pciBusID 0000:01:00.0 Total memory: 7.92GiB Free memory: 7.56GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY E tensorflow/stream_executor/cuda/cu

我得到這個下面這表明某種內存分配是怎麼回事?但仍然失敗。

`I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 899200000 totalling 4.19GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1649756928 totalling 1.54GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.40GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:     8499298304 
InUse:     6875780608 
MaxInUse:    6878976000 
NumAllocs:      338 
MaxAllocSize:   1649756928 

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ******************************************************************************************xxxxxxxxxx 
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 6.10MiB. See logs for memory state. 
W tensorflow/core/framework/op_kernel.cc:993] Internal: Dst tensor is not initialized. 
    [[Node: linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice/_1055 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1643_linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]] 

` 更新:我從減少數以百萬計的記錄計數至40,000有一個基本模型運行至結束。我仍然收到一條錯誤消息,但不是連續的。我在模型輸出中獲得了一堆文本,提示重構模型,我懷疑數據結構是問題的一個重要部分。仍然可以使用一些更好的提示如何調試的全過程..下面是剩下的控制檯輸出

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1070 
major: 6 minor: 1 memoryClockRate (GHz) 1.645 
pciBusID 0000:01:00.0 
Total memory: 7.92GiB 
Free memory: 7.52GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) 
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 
[I 09:13:09.297 NotebookApp] Saving file at /Documents/InfluenceH/Working_copies/Cond_fcast_wkg/TensorFlow+DNNLinearCombinedClassifier+for+Influence.ipynb 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) 
+0

這個未回答的問題很相似:http://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu?rq=1 – dartdog

回答

1

我認爲這個問題是TensorFlow嘗試分配GPU內存7.92GB,而只有7.56GB是實際上免費。我不能告訴你是因爲什麼原因在GPU內存的其餘部分被佔領,但你可能會通過限制GPU內存程序允許分配的分數避免這個問題:

sess_config = tf.ConfigProto() 
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 
with tf.Session(config=sess_config, ...) as ...: 

有了這個,程序會只分配90%的GPU內存,即7.13GB。

+0

沒有得到什麼應該在...的地方在最後一行?另請參閱我的更新... – dartdog

+1

圓括號之間的圓點可以用一些其他選項替換,這些選項用於初始化tf.Session()。這些選項應該是您可能已經指定的選項,如果有的話。如果您沒有更多規格,請刪除逗號和點。之前「:」你的定義,你會調用tf.Session(),例如'用tf.Session(配置= sess_config)作爲SESS名稱:' – ml4294

+0

很大的幫助!仍然需要重新構建我認爲的模型..但已經超過了最初的錯誤 – dartdog