Panwang Zhou
pwzhou@gmail.com
By setting the MPISNC option to .T., the jobs terminated normally with openmpi 1.8.8, Thanks.
On Wed Jun 14 '17 0:02am, Alex Granovsky wrote
----------------------------------------------
>Dear Panwang Zhou,
>all known versions of Open MPI have buggy implementation of collective
>operations. Their use may result in program hangs. This problem is
>caused by design flaw of implementation of collective operations in
>Open MPI.
>Some versions of Open MPI included a "bugfix" for this flaw.
>This "bugfix" periodically synchronizes processes by calling
>MPI_Barrier after a certain number of calls to collective operations.
>As far as I know, this "bugfix" was removed in the recent versions of
>Open MPI.
>Independently on the existence of this "bugfix", Firefly has several
>specific keywords that can be used to solve problems with collective
>operations. These keywords were introduced about twenty years ago.
>They belong to the $SYSTEM group and are as follows:
>
MXBCST (integer) - the maximum size (in DP words) of the message used in broadcast operation. Default is 32768. You can change it to see whether this helps MPISNC (logical) - activates the strategy when the call of the broadcast operation will periodically synchronize all MPI processes. Default is false. Setting it to true should resolve most buffer-overflow problems by the cost of somewhat reduced performance. MXBNUM (integer) - the maximum number of broadcast operations which can be performed before the global synchronization call is done. Relevant if MPISNC=.true. Default is 100. LENSNC (integer) - the maximum total length (in DP words) of all messages which can be broadcasted before the global synchronization call is done. Relevant if MPISNC=.true. Default is dependent on the number of processes used (meaningful values vary from 20000 to, say, 262144 or even more).
>I'd suggest you to try MPISNC option first, i.e. run Firefly's job with MPISNC=.t.
>Hope this helps.
>Kind regards,
>Alex Granovsky
>
>
>
>On Fri May 26 '17 6:05am, Panwang Zhou wrote
>--------------------------------------------
>>Dear all,
>>Recently I upgrade the OS of our cluster to CentOS 7.3, and then I try to install the Firefly 8.2 Linux/OpenMPI v. 1.8.x, dynamically linked version and 2.0.x.
>>I compile the openmpi with the following commands:
>>../configure --prefix=/apps/mpi/openmpi/1.8.7/gnu_m32 CC=gcc CXX=g++ FC=gfortran CFLAGS=-m32 CXXFLAGS=-m32 FCFLAGS=-m32
>>make all install
>>Then I try to run test jobs and the jobs hang after some normal calculations: Firefly is running and the updating the output file is stopped. This is also for the version 2.0.2.
>>When I switch to the openmpi 1.6.5, all the calcualtions terminated normally.
>>So what's the problem for the openmpi 1.8.x and 2.0.x, are there some special compiler parameters needed?
>>Best Regards!
>>