2017-08-17 149 views
2

當我運行我的tensorflow應用程序時,它只輸出「殺死」。我該如何調試?爲什麼tensorflow只是輸出殺死

source code

[email protected]:~/tensorflow# python sample_cnn.py 
INFO:tensorflow:Using default config. 
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_model_dir': 'data/convnet_model', '_save_summary_steps': 100} 
INFO:tensorflow:Create CheckpointSaverHook. 
2017-08-17 12:56:53.160481: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160536: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160545: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160550: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 
2017-08-17 12:56:53.160555: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 
Killed 
+0

謝謝!你們真棒!經過一些調整我的參數後,我可以在16GB的筆記本電腦上運行它。 – reachlin

回答

4

當我運行代碼,我得到了相同的行爲,打字dmesg後,你會看到一個跟蹤等,其證實了gdelab在暗示:

[38607.234089] python3 invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 
[38607.234090] python3 cpuset=/ mems_allowed=0 
[38607.234094] CPU: 3 PID: 1420 Comm: python3 Tainted: G   O 4.9.0-3-amd64 #1 Debian 4.9.30-2+deb9u2 
[38607.234094] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.2.4 03/29/2017 
[38607.234096] 0000000000000000 ffffffffa9f28414 ffffa50090317cf8 ffff940effa5f040 
[38607.234097] ffffffffa9dfe050 0000000000000000 0000000000000000 0101ffffa9d82dd0 
[38607.234098] e09c7db7f06d0ac2 00000000ffffffff 0000000000000000 0000000000000000 
[38607.234100] Call Trace: 
[38607.234104] [<ffffffffa9f28414>] ? dump_stack+0x5c/0x78 
[38607.234106] [<ffffffffa9dfe050>] ? dump_header+0x78/0x1fd 
[38607.234108] [<ffffffffa9d8047a>] ? oom_kill_process+0x21a/0x3e0 
[38607.234109] [<ffffffffa9d800fd>] ? oom_badness+0xed/0x170 
[38607.234110] [<ffffffffa9d80911>] ? out_of_memory+0x111/0x470 
[38607.234111] [<ffffffffa9d85b4f>] ? __alloc_pages_slowpath+0xb7f/0xbc0 
[38607.234112] [<ffffffffa9d85d8e>] ? __alloc_pages_nodemask+0x1fe/0x260 
[38607.234113] [<ffffffffa9dd7c3e>] ? alloc_pages_vma+0xae/0x260 
[38607.234115] [<ffffffffa9db39ba>] ? handle_mm_fault+0x111a/0x1350 
[38607.234117] [<ffffffffa9c5fd84>] ? __do_page_fault+0x2a4/0x510 
[38607.234118] [<ffffffffaa207658>] ? page_fault+0x28/0x30 
... 
[38607.234158] [ pid ] uid tgid total_vm  rss nr_ptes nr_pmds swapents oom_score_adj name 
... 
[38607.234332] [ 1396] 1000 1396 4810969 3464995 6959  21  0    0 python3 
[38607.234332] Out of memory: Kill process 1396 (python3) score 568 or sacrifice child 
[38607.234357] Killed process 1396 (python3) total-vm:19243876kB, anon-rss:13859980kB, file-rss:0kB, shmem-rss:0kB 
[38607.720757] oom_reaper: reaped process 1396 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 

這基本上意味着蟒蛇已開始太消耗太多許多內存和內核決定終止進程。如果您在代碼中添加一些打印件,您會看到mnist_classifier.train()是處於活動狀態的功能。然而,一些愚蠢的測試(如刪除日誌記錄和降低步驟,似乎沒有幫助)。

3

你的程序是由您的操作系統殺死,Tensorflow有不知道爲什麼,這不是爲什麼它輸出任何東西。這可能是由於內存不足錯誤。

檢查您syslog包含這樣一行:

<date> <computer> kernel: [...] Out of memory: Kill process <id> (python) score <...> or sacrifice child 

如果是這樣,你需要增加允許蟒蛇的內存,和/或降低你的程序使用的內存。

3

正如其他評論者所說,你的操作系統會因爲內存不足而殺死你的進程。你正試圖建立一個龐大的網絡。讓我們看看你最後的密集層。它有65536個輸入和65536個單位。每個單位對每個輸入都有權重,因此使得權重爲65536 * 65536 = 4294967296。權重是基於你輸入的dtype,我認爲你的是float64,所以讓它乘以64,你得到32GB的權重(65536 * 65536 * 64/1024/1024/1024/8 = 32)。所有這些權重都是單張張量,必須作爲一個整體進行操作,因此它必須完全適合RAM。你的系統有32GB的RAM嗎?

相關問題