Alex Granovsky
gran@classic.chem.msu.su
all known versions of Open MPI have buggy implementation of collective
operations. Their use may result in program hangs. This problem is
caused by design flaw of implementation of collective operations in
Open MPI.
Some versions of Open MPI included a "bugfix" for this flaw.
This "bugfix" periodically synchronizes processes by calling
MPI_Barrier after a certain number of calls to collective operations.
As far as I know, this "bugfix" was removed in the recent versions of
Open MPI.
Independently on the existence of this "bugfix", Firefly has several
specific keywords that can be used to solve problems with collective
operations. These keywords were introduced about twenty years ago.
They belong to the $SYSTEM group and are as follows:
MXBCST (integer) - the maximum size (in DP words) of the message used in broadcast operation. Default is 32768. You can change it to see whether this helps MPISNC (logical) - activates the strategy when the call of the broadcast operation will periodically synchronize all MPI processes. Default is false. Setting it to true should resolve most buffer-overflow problems by the cost of somewhat reduced performance. MXBNUM (integer) - the maximum number of broadcast operations which can be performed before the global synchronization call is done. Relevant if MPISNC=.true. Default is 100. LENSNC (integer) - the maximum total length (in DP words) of all messages which can be broadcasted before the global synchronization call is done. Relevant if MPISNC=.true. Default is dependent on the number of processes used (meaningful values vary from 20000 to, say, 262144 or even more).
I'd suggest you to try MPISNC option first, i.e. run Firefly's job with MPISNC=.t.
Hope this helps.
Kind regards,
Alex Granovsky
On Fri May 26 '17 6:05am, Panwang Zhou wrote
--------------------------------------------
>Dear all,
>Recently I upgrade the OS of our cluster to CentOS 7.3, and then I try to install the Firefly 8.2 Linux/OpenMPI v. 1.8.x, dynamically linked version and 2.0.x.
>I compile the openmpi with the following commands:
>../configure --prefix=/apps/mpi/openmpi/1.8.7/gnu_m32 CC=gcc CXX=g++ FC=gfortran CFLAGS=-m32 CXXFLAGS=-m32 FCFLAGS=-m32
>make all install
>Then I try to run test jobs and the jobs hang after some normal calculations: Firefly is running and the updating the output file is stopped. This is also for the version 2.0.2.
>When I switch to the openmpi 1.6.5, all the calcualtions terminated normally.
>So what's the problem for the openmpi 1.8.x and 2.0.x, are there some special compiler parameters needed?
>Best Regards!
>