actually, this is a little bit tricky.
The SCF part of MP4 runs in parallel; both in direct and
conventional modes. Conventional SCF is also capable to use
several compute threads. However, this is not programmed for
direct SCF runs; and the scalability of threaded conventional
SCF is not very good as I/O becomes the bottleneck.
The integral transformation stage is duplicated over all
the instances of any parallel MP3 or MP4 run. All processes
performs the same work and requires the separate copy of the
file containing transformed integrals. This will be changed
in the future releases.
The MP4(SDQ) part of MP4 calculations is completely multithreaded.
However, different parts of this code scales differently with the
number of working threads used in calculations. The code is good
but was written many years ago and was not allowed to run in
parallel that time. Most likely, this will be changed in the
future; however, current code just duplicates SDQ part of
computations when running in parallel.
The MP4(T) part of the code is the most advanced (and time-consuming) one.
It runs in parallel with a variety of available submodes of execution.
It is also completely threaded, and one can use both levels of
parallelism at once. Moreover, there is a special option of MP4
code ($mp4 tonly=.t. $end) to perform just MP4(T) calculations
skipping SDQ part and some unneeded steps of integral transformation
The most optimal way to perform large-scale calculations is thus
a bit tedious.
First, one need to perform parallel SCF calculations to get
the converged SCF vectors. SCF step can be either direct or
conventional depending on what is faster on your hardware.
Second, one need to perform MP4(SDQ) calculations on a standalone
SMP system using thread-level parallelized code. One need to provide
the converged SCF vectors ($vec group) as the initial guess. Again,
this can be either direct or conventional calculations.
Finally, one need to run MP4(T) part in the mixed parallel+threaded
model using as much nodes as possible. One may provide the converged
SCF vectors ($vec group) as the initial guess. Once again, this can
be either direct or conventional calculations. The optimal number of
threads per process depends on the particular computer architecture.
In the most cases, it is equal to the number of physical cores
sharing the same memory domain on the cc-NUMA system. E.g., assuming
you have 2-way four-core Xeon 5500 system, the best way is to use 2
processes per box with four working threads each. However, this can
be adjusted to minimize I/O or scratch storage, and the code is
flexible here. Firefly version 7.1.G supports dynamic load
balancing for MP4(T) part; both between threads and between
processes, so it is wise to activate p2p and dlb options.
Hope this helps.
On Wed Mar 3 '10 0:59am, Jiri Wiesner wrote
>Dear Firefly users and developers,
>I would like to run Firefly in a machine with 8 processors, each with 4 cores under 64bit/32bit emul. Debian GNU/Linux 5.0.3, kernel 2.6.30, libc 2.7. I'm using Open MPI version of Firefly, as Open MPI it installed on our cluster in version 1.2.9. Our 32 CPU machine has only one RAID device, where I can store integrals during MP4 integral transformation.
>And here is the obstacle: When I use 8 parallel processes through OMPI (mpirun -np 8), I'm running out of the disk space soon. When I use MKLNP=8 flag for SMP, the first HF computation and some other part are running on 1 CPU (slow on larger systems with large basis sets). I chose to run 2 OMPI processes and use MKLNP=4. In this case the space for the DASORT and MOINTS files is twice as large, but the machine can manage that.
>My question is: Is there any better way how to run on one node on as many CPUs as possible and use as little disk space as possible? Is it possible to force MPI processes to use a single set of integral files?
>Thank you for your answers.