Ceilidh ... Re: CUDA on SMP running in parallel

Firefly and PC GAMESS-related discussion club

Re: CUDA on SMP running in parallel

Hi,

As to memory allocation, it seems to be the peculiarity of your
Windows 7 installation. I have just checked under W7 Professional
with all the latest updates installed and the memory limit is ca.
268 MWords as can be expected for Firefly running under 64-bit
Windows (there are several threads on the forum discussing various
aspects of memory allocation under Windows and Linux).

As to CUDA functionality, it was not completely documented so far
so I definitely should provide some basic information here.

First, each allotted CUDA device uses at least one dedicated CPU
thread (one thread with Firefly 7.1.G and optionally up to 32 threads
with the current beta of Firefly v 8.0.0). The number of threads
used by MKL is either just the single one, or the number specified
by mklnp; with dynamic switching between these two modes as
needed for better performance. The number of compute CPU
threads is either just the single thread (for the most of the code).
However, it is equal to np*x during execution of threaded code.
The default for np is the value of mklnp; however, np
can be set separately:

 $system mklnp=... np=... $end

Note the np variable is not related with the number of
processes created using -np command line argument to Firefly or
to mpirun/mpiexec. Moreover, it can be set individually for each
Firefly instance of the entire parallel process.
The value of x is equal to either one or two - the latter
value is for systems with enabled Hyper-Threading and only in the
sections of code that benefit from the use of Hyper-Threading.

In addition, there are typically some extra threads used for
communications over p2p and for asynchronous disk I/O. Note, these
threads (as well as dedicated CUDA threads) are not the CPU
compute threads! Normally, I/O threads consume only minor part of
available CPU resources and are mainly in the blocked state waiting
for some event or so. However, each CUDA thread consumes almost all
resources of a single CPU core executing CUDA-enabled code.

Hence, running your particular job using single process, for
optimal performance you need to set mklnp to 4 (as our standard BLAS
operations do not use CUDA at all), and np to 3 (the fourth thread
is the dedicated CUDA thread assuming single CUDA device).

At the same time, running it in parallel, you should not use more
than 4 MKL threads and more than 4 (CPU plus CUDA) threads in overall.
With Firefly 7.1.G, there is no way to enable use of CUDA for only
part of processes, hence you are effectively using two CUDA-dedicated
threads sharing the same CUDA device, one per each Firefly instance.
This limitation was removed in Firefly 8.0.0, so you can consider to
apply for beta version of Firefly 8.0.0.

Indeed, by default two Firefly instances running on the same
system share the same set of CUDA devices. This can be fine-tuned
or disabled, esp. with Firefly 8.0.0. Sharing of pre-Fermi CUDA
devices is not efficient and results in severe performance
degradation due to architectural limitations of these GPUs.

Sharing of Fermi-based boards can result in some increase of
the overall performance; however, this strongly depends on the
particular version of cuBLAS. More precisely, version 3.1 performs
better in shared mode, while use of version 3.2 in shared mode
causes severe performance penalty.

Finally, W7 does not perform well on simultaneous disk I/O,
esp. if the same disk is used by all Firefly instances.
It also does not perform well enough in the case of core over-
subscription, i.e, if the number of active threads is larger
than the number of cores.

Hope this helps.
Alex Granovsky

On Sun Dec 5 '10 8:48pm, Alexander wrote
----------------------------------------
>Hi,

>I'm running large MP4(SDTQ) calculations in parallel on a SMP system (quad-core i7 with HTT disabled, 6GB RAM, Windows 7 x64) using two processes (2 threads each, i.e. -NP 2 and MLKNP=2) and a single GTX 285 card. Here is a couple questions:

>1) For some reason I cannot use more than 200MW RAM. When I set the memory limit above 200MW (which I can afford on my system), Firefly run terminates reporting that this amount of memory is not available. Note that this error appears irrespective of the way I run Firefly (i.e. parallel or serial) and of the number of cores used. Is this an inherent limitation of 32-bit Firefly, or am I doing something wrong here?

>2) When running MP4(SDTQ) in parallel using a single CUDA device, the system behaves rather strangely. The serial SMP version works well and provides a substantial acceleration due to the use of CUDA. In these circumstances the CPU load is ca. 100% and GPU load is ca. 60-70% for the triples part. When I try to run the same job in parallel (-NP 2 and MLKNP=2) to speed up the HF part of the calculation, the system is barely responding. Despite the lower CPU load compared to the serial mode, the system is almost impossible to use, and there is a substantial performance degradation for the triples part of the calculation. So my question is: how is a single CUDA device handled during the parallel execution on a SMP system? Is it possible that the problems arise due to both processes trying to use a single CUDA simultaneously?

Mon Dec 6 '10 7:33pm

This message read 1800 times