Ceilidh ... Re: MP4(SDTQ) with CUDA

Firefly and PC GAMESS-related discussion club

Learn how to ask questions correctly

We are NATO-free zone

Re: MP4(SDTQ) with CUDA

Alex Granovsky
gran@classic.chem.msu.su

Dear Ivan,

sorry for late reply. There was some problems with my home
computers preventing me to reply earlier.

For CPU-only based tests on a standalone computer system
the best strategy would be most likely to use a single
Firefly process with several working threads.
This will minimize the overall amount of concurrent I/O,
and the (T) part is threaded very efficiently.

> be done in fewer passes. Is this true? If I use mklnp=9 np=6 and
> run in serial mode, CPU usage is actually 600% but I haven't got
> energy output after waiting for 4 hours.

This is because of errors in your input file. Normally, you cannot
set mklnp to 9 on a six-core system. You can find more informaion
on how Firefly treats logical cores here:

http://classic.chem.msu.su/gran/gamess/smp.html

and by searching this forum.

You'll get the optimal performance by setting mklnp=6 and np=6
while leaving other options at their default settings.

For GPU tests, there are multiple possible scenarios.
The best one is to be found experimentally. Most likely,
ther best is to use two processes of three GPU-controlling
threads each and of three CPU-compute threads. Please let
me know if you are still interested and I'll provide details
on the input file options.

Kind regards,
Alex Granovsky

On Mon Apr 22 '13 4:31pm, Ivan Fedyanin wrote
---------------------------------------------
>Dear all,

>I'd like to test Firefly/CUDA performance in the MP4(SDTQ) calculation on a Linux Core i7 workstation with 6 cores (+6 via HT?) , 24 GiB RAM and three GTX 480 cards. I've already done a calculation using CPU only, with memory=450000000 and 6 processes via MPI that took app. 2 hours with the latest 8.x beta version of Firefly.

>Now I'm a doubt what should I use for mklnp, np and/or other keys, even undocumented, to get better performance and to use GPU cards is shared mode(?). As far as I understand, some parts of the input should look like

>$cuda cumask=7 $end
>$smp httnp=1(2?) cuda=.t. $end

>Also, I believe that the usage of parallel execution here is crucial, because more memory may be allocated and integral transformations may be done in fewer passes. Is this true? If I use mklnp=9 np=6 and run in serial mode, CPU usage is actually 600% but I haven't got energy output after waiting for 4 hours.

>The system has NCORE= 10 NOCC= 38 NAOS= 366 if it matters.
>

Sat Apr 27 '13 6:56pm

This message read 1202 times