我一直在使用PETSc庫編寫一些代碼,現在我要改變它的一部分以並行方式運行。大多數我想要並行化的東西是矩陣初始化和我生成和計算大量值的部分。無論如何,我的問題在於,如果我運行的代碼超過1核心,出於某種原因,代碼的所有部分的運行次數與我使用的核心數量一樣多。C++和MPI如何將部分代碼編寫爲並行?
這是我測試的PETSc和MPI
int main(int argc, char** argv)
{
time_t rawtime;
time (&rawtime);
string sta = ctime (&rawtime);
cout << "Solving began..." << endl;
PetscInitialize(&argc, &argv, 0, 0);
Mat A; /* linear system matrix */
PetscInt i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
PetscMPIInt rank; // processor rank
PetscMPIInt size; // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);
cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;
cout << "Generating 2D-Array" << endl;
double temp2D[120000][3];
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
temp2D[Ii][J] = 1;
}
}
cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;
v = -1.0;
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
}
}
cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
MPI_Finalize();
cout << "No more MPI" << endl;
return 0;
}
而我真正的程序有幾個不同的.cpp文件只是簡單的示例代碼。我在主程序中初始化MPI,在另一個.cpp文件中調用了一個函數,我在那裏實現了相同類型的矩陣填充,但是在填充矩陣之前程序所做的所有cout的打印次數將與我的核心數一樣多。
我可以運行我的測試程序的程序mpiexec -n 4測試,它成功運行,但由於某種原因,我要運行我的真正的程序爲我的測試程序的程序mpiexec -n 4 ./myprog
輸出是繼
Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI
編輯後的兩點意見: 所以我的目標是擁有20個節點的小集羣上運行這一點,每個節點有2個內核。後來這應該在超級計算機上運行,所以mpi絕對是我需要去的方式。我目前正在兩臺不同的機器上測試它,其中一臺具有1個處理器/ 4個內核,另一個具有4個處理器/ 16個內核。
所以,當我用mpiexec或mpirun運行我的程序時,我應該在我們的小羣集上執行此操作,並且在這種情況下,程序的不同副本將分配給不同的節點。之後,節點只是做自己的工作,並返回我想要形成的矩陣的部分回到服務器,它會創建矩陣?我編輯了我的原始文件,所以你會看到我需要使用mpi。 – Mare2 2012-07-13 20:15:43