2017-08-11 139 views
0

我嘗試在P100節點上安裝和使用Theano與Cuda-9.0。安裝本身流暢,但我得到分段錯誤(見下文)。使用cuda-9.0的Theano段錯誤

我嘗試使用Theano-0.9.0和Theano-0.10.0beta1結合使用libgpuarray/pygpu - 0.6.8和0.6.9。所有的情況都會導致段錯誤。

這裏是我的設置: * RHEL 7 * GCC:4.8.5 * CUDA 9.0 * cuDNN:5.1.5 *的Python:2.7.13 * cmake的:3.7.2

[[email protected] ~]$ python 
Python 2.7.13 (default, Aug 10 2017, 07:33:11) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import theano 
-------------------------------------------------------------------------- 
A process has executed an operation involving a call to the 
"fork()" system call to create a child process. Open MPI is currently 
operating in a condition that could result in memory corruption or 
other system errors; your job may hang, crash, or produce silent 
data corruption. The use of fork() (or system() or other calls that 
create child processes) is strongly discouraged. 

The process that invoked fork was: 

    Local host:   [[52508,1],0] (PID 3946) 

If you are *absolutely sure* that your application will successfully 
and correctly survive a call to fork(), you may disable this warning 
by setting the mpi_warn_on_fork MCA parameter to 0. 
-------------------------------------------------------------------------- 
[c460:03946] *** Process received signal *** 
[c460:03946] Signal: Segmentation fault (11) 
[c460:03946] Signal code: Invalid permissions (2) 
[c460:03946] Failing at address: 0x3fff8d48f5b0 
[c460:03946] [ 0] [0x3fff9cdf0478] 
[c460:03946] [ 1] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(load_libcuda+0x60)[0x3fff8631b5e0] 
[c460:03946] [ 2] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x3f384)[0x3fff862df384] 
[c460:03946] [ 3] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x41118)[0x3fff862e1118] 
[c460:03946] [ 4] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(gpucontext_init+0x90)[0x3fff862c7930] 
[c460:03946] [ 5] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x2c974)[0x3fff8638c974] 
[c460:03946] [ 6] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x101050)[0x3fff9cc61050] 
[c460:03946] [ 7] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x54318)[0x3fff863b4318] 
[c460:03946] [ 8] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x56530)[0x3fff863b6530] 
[c460:03946] [ 9] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554] 
[c460:03946] [10] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8e64)[0x3fff9ccc9484] 
[c460:03946] [11] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [12] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524] 
[c460:03946] [13] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [14] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524] 
[c460:03946] [15] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [16] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484] 
[c460:03946] [17] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960] 
[c460:03946] [18] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x188e50)[0x3fff9cce8e50] 
[c460:03946] [19] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18ad54)[0x3fff9ccead54] 
[c460:03946] [20] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18a540)[0x3fff9ccea540] 
[c460:03946] [21] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x2f4)[0x3fff9cceb7b4] 
[c460:03946] [22] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x15d038)[0x3fff9ccbd038] 
[c460:03946] [23] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554] 
[c460:03946] [24] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyObject_Call+0x74)[0x3fff9cbc1ab4] 
[c460:03946] [25] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x68)[0x3fff9ccbfc68] 
[c460:03946] [26] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3214)[0x3fff9ccc3834] 
[c460:03946] [27] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [28] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484] 
[c460:03946] [29] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960] 
[c460:03946] *** End of error message *** 
Segmentation fault 

任何幫助,將不勝感激。謝謝。

回答

0

從網上抓取演示mpi C++或c代碼,並用mpicc/mpiC++進行編譯。檢查編譯器是否工作並且您製作的可執行文件可以運行,並且可以管理羣集中不同節點之間的點對點通信。

您可能使用了錯誤的mpicc來編譯theano,並且該編譯器與inifiniband(或連接集羣中的計算機的任何硬件)庫沒有二進制兼容性。

例如,如果InfiniBand庫由gcc編譯,並且theano由基於intel編譯器的mpicc編譯,那麼它將不起作用。

您可以設置一個環境變量來請求openmpi的mpicc使用另一個編譯器。

如果您在該計算機上有不同編譯器編譯的多個mpi實現...嘗試使用ldd來找出哪個共享庫對象(那些.so文件)取決於哪一個。

最好的情況當然是使用相同的編譯器和相同的mpi包裝來編譯所有的東西,並將這些文件包裝成幾個modules

0

答案變成了gcc版本和libgpuarray。出於某種原因,gcc-4.8.5與libgpuarray存在問題,這就是導致分段錯誤的原因。

我在我的用戶空間中安裝了gcc-5.4.0,並重新編譯了cmake和libgpuarray以及其他的包括theano和numpy(只是可以肯定),然後它不再有Segmentation錯誤。

另一個變化是集羣管理員使用新的驅動程序將CUDA更新到9.0.151 384.66