Firefly and PC GAMESS-related discussion club

Learn how to ask questions correctly  
We are NATO-free zone

Re: Problem on runging Firefly 8.2 with openmpi 1.8.x and 2.0.x

Alex Granovsky

Dear Panwang Zhou,

all known versions of Open MPI have buggy implementation of collective
operations. Their use may result in program hangs. This problem is
caused by design flaw of implementation of collective operations in
Open MPI.

Some versions of Open MPI included a "bugfix" for this flaw.
This "bugfix" periodically synchronizes processes by calling
MPI_Barrier after a certain number of calls to collective operations.
As far as I know, this "bugfix" was removed in the recent versions of
Open MPI.

Independently on the existence of this "bugfix", Firefly has several
specific keywords that can be used to solve problems with collective
operations. These keywords were introduced about twenty years ago.
They belong to the $SYSTEM group and are as follows:

MXBCST (integer) - the maximum size (in DP words) of the message
                   used in broadcast operation. Default is 32768.
                   You can change it to see whether this helps

MPISNC (logical) - activates the strategy when the call of the
                   broadcast operation will periodically
                   synchronize all MPI processes.

                   Default is false. Setting it to true should
                   resolve most buffer-overflow problems by the
                   cost of somewhat reduced performance.

MXBNUM (integer) - the maximum number of broadcast operations
                   which can be performed before the global
                   synchronization call is done.
                   Relevant if MPISNC=.true. Default is 100.

LENSNC (integer) - the maximum total length (in DP words) of all
                   messages which can be broadcasted before the
                   global synchronization call is done.
                   Relevant if MPISNC=.true. Default is dependent
                   on the number of processes used (meaningful values
                   vary from 20000 to, say, 262144 or even more).

I'd suggest you to try MPISNC option first, i.e. run Firefly's job with MPISNC=.t.

Hope this helps.

Kind regards,
Alex Granovsky

On Fri May 26 '17 6:05am, Panwang Zhou wrote
>Dear all,

>Recently I upgrade the OS of our cluster to CentOS 7.3, and then I try to install the Firefly 8.2 Linux/OpenMPI v. 1.8.x, dynamically linked version and 2.0.x.

>I compile the openmpi with the following commands:
>../configure --prefix=/apps/mpi/openmpi/1.8.7/gnu_m32 CC=gcc CXX=g++ FC=gfortran CFLAGS=-m32 CXXFLAGS=-m32 FCFLAGS=-m32
>make all install

>Then I try to run test jobs and the jobs hang after some normal calculations: Firefly is running and the updating the output file is stopped. This is also for the version 2.0.2.

>When I switch to the openmpi 1.6.5, all the calcualtions terminated normally.

>So what's the problem for the openmpi 1.8.x and 2.0.x, are there some special compiler parameters needed?

>Best Regards!


[ Previous ] [ Next ] [ Index ]           Wed Jun 14 '17 0:02am
[ Reply ] [ Edit ] [ Delete ]           This message read 95 times