>actually, this is a little bit tricky.
>The SCF part of MP4 runs in parallel; both in direct and
>conventional modes. Conventional SCF is also capable to use
>several compute threads. However, this is not programmed for
>direct SCF runs; and the scalability of threaded conventional
>SCF is not very good as I/O becomes the bottleneck.
>The integral transformation stage is duplicated over all
>the instances of any parallel MP3 or MP4 run. All processes
>performs the same work and requires the separate copy of the
>file containing transformed integrals. This will be changed
>in the future releases.
>The MP4(SDQ) part of MP4 calculations is completely multithreaded.
>However, different parts of this code scales differently with the
>number of working threads used in calculations. The code is good
>but was written many years ago and was not allowed to run in
>parallel that time. Most likely, this will be changed in the
>future; however, current code just duplicates SDQ part of
>computations when running in parallel.
>The MP4(T) part of the code is the most advanced (and time-consuming) one.
>It runs in parallel with a variety of available submodes of execution.
>It is also completely threaded, and one can use both levels of
>parallelism at once. Moreover, there is a special option of MP4
>code ($mp4 tonly=.t. $end) to perform just MP4(T) calculations
>skipping SDQ part and some unneeded steps of integral transformation
>The most optimal way to perform large-scale calculations is thus
>a bit tedious.
>First, one need to perform parallel SCF calculations to get
>the converged SCF vectors. SCF step can be either direct or
>conventional depending on what is faster on your hardware.
>Second, one need to perform MP4(SDQ) calculations on a standalone
>SMP system using thread-level parallelized code. One need to provide
>the converged SCF vectors ($vec group) as the initial guess. Again,
>this can be either direct or conventional calculations.
>Finally, one need to run MP4(T) part in the mixed parallel+threaded
>model using as much nodes as possible. One may provide the converged
>SCF vectors ($vec group) as the initial guess. Once again, this can
>be either direct or conventional calculations. The optimal number of
>threads per process depends on the particular computer architecture.
>In the most cases, it is equal to the number of physical cores
>sharing the same memory domain on the cc-NUMA system. E.g., assuming
>you have 2-way four-core Xeon 5500 system, the best way is to use 2
>processes per box with four working threads each. However, this can
>be adjusted to minimize I/O or scratch storage, and the code is
>flexible here. Firefly version 7.1.G supports dynamic load
>balancing for MP4(T) part; both between threads and between
>processes, so it is wise to activate p2p and dlb options.
>Hope this helps.
>On Wed Mar 3 '10 0:59am, Jiri Wiesner wrote
>>Dear Firefly users and developers,
>>I would like to run Firefly in a machine with 8 processors, each with 4 cores under 64bit/32bit emul. Debian GNU/Linux 5.0.3, kernel 2.6.30, libc 2.7. I'm using Open MPI version of Firefly, as Open MPI it installed on our cluster in version 1.2.9. Our 32 CPU machine has only one RAID device, where I can store integrals during MP4 integral transformation.
>>And here is the obstacle: When I use 8 parallel processes through OMPI (mpirun -np 8), I'm running out of the disk space soon. When I use MKLNP=8 flag for SMP, the first HF computation and some other part are running on 1 CPU (slow on larger systems with large basis sets). I chose to run 2 OMPI processes and use MKLNP=4. In this case the space for the DASORT and MOINTS files is twice as large, but the machine can manage that.
>>My question is: Is there any better way how to run on one node on as many CPUs as possible and use as little disk space as possible? Is it possible to force MPI processes to use a single set of integral files?
>>Thank you for your answers.