我們在sgi uv 2000(smp)w 256超線程核心(128物理)上運行oge 2011.11。當我們在系統上運行openmp作業時,它運行良好。這裏的工作:SGI機器在SGE上的cpu負載失控
#include <iostream>
#include <cstring>
#include <cstdlib>
#include <math.h>
#include <omp.h>
using namespace std;
int main (
int argc,
char* argv[]) {
#if _OPENMP
// Show how many threads we have available
int max_t = omp_get_max_threads();
cout << "OpenMP using up to " << max_t << " threads" << endl;
#else
cout << "!!!ERROR!!! Program not compiled for OpenMP" << endl;
return -1;
#endif
const long N = 115166;
const long bytesRequested = N * N * sizeof(double);
cout << "Allocating " << bytesRequested << " bytes for matrix" << endl;
double* S = new double[ N * N ];
if(NULL == S) {
cout << "!!!ERROR!!! Failed to allocate " << bytesRequested << " bytes" << endl;
return -1;
}
cout << "Entering main loop" << endl;
#pragma omp parallel for schedule(static)
for (long i = 0; i < N - 1; i++) {
for (long j = i + 1; j < N; j++) {
#if _OPENMP
int tid=omp_get_thread_num();
if(0 == i && 1 == j) {
int nThreads=omp_get_num_threads();
cout << "OpenMP loop using " << nThreads << " threads" << endl;
}
#endif
S[ i * N + j ] = sqrt(i + j);
}
}
cout << "Loop completed" << endl;
delete S;
return 0;
}
而且這裏是它的執行:
[C++] $ ./OMPtest OpenMP的使用多達256個線程 分配矩陣 106105660448個字節輸入使用256主迴路 OpenMP的循環線程 循環完成
但是,當我使用以下(和迄今任何)並行環境在隊列中提交它時,cpu的負載通過屋頂射擊(超過256),並且系統becom es完全無響應,必須重新啓動。這裏是我的體育環境:
[C++] $的qconf -sp螺紋 pe_name螺紋 插槽10000個 user_lists NONE xuser_lists NONE start_proc_args /斌/真 stop_proc_args /斌/真 allocation_rule $ pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots分鐘 accounting_summary TRUE
我已經改變了control_slaves,job_is_first_task,插槽(在140減少到140任何東西,我得到的失控以前描述的加載條件)我甚至使用了我創建的不同並行環境。我還將隊列中的插槽數量減少到了140個,但負載仍然運行並鎖定了機器。最後,我已經試過無數次的迭代,但這裏是我的qsub腳本:
#!/bin/sh
#$ -cwd
#$ -q sgi-test
## email on a - abort, b - begin, e - end
#$ -m abe
#$ -M <email address>
#source ~/.bash_profile
## for this job, specifying the threaded environment w a "-" ensures the max number of processors is used
#$ -pe threaded -
echo "slots = $NSLOTS"
export OMP_NUM_THREADS=$NSLOTS
echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
echo "Running on host=$HOSTNAME"
## memory resource request per thread, max 24 for 32 threads
#$ -l h_vmem=4G
##$ -V
##this environment variable setting is needed only for OpenMP-parallelized applications
## finally! -- run your process
<path>/OMPtest
最後,由於無限處理器/插槽一直墜毀mahcine,我指定:
#$ -pe threaded 139
任何超過139崩潰了機器,但是在mcelog或/ var/log/messages中沒有輸出。任何對可能發生的事情的深入瞭解都將不勝感激!
哇,沒有迴應?我明白了,這是一個艱難的大聲笑 – steelah1
解決它自己。在腳本中添加了「-V」選項,以將我的環境變量推送到oge/sge,因爲作業在調度程序之外的環境中運行得很好。每次運行都不會崩潰。可以通過消除/反覆試驗追蹤導致問題的變量,但是我有很多變數。總之,「-V」修復了很多問題,特別是如果你的工作在OGE/SGE之外運行得很好。 – steelah1