Firefly and PC GAMESS-related discussion club

Learn how to ask questions correctly

Re^5: Firefly 7.1G with CUDA 2.3

Alex Granovsky

Hi Ruslan,

> Creating thread pool to serve up to             64 threads.
>What does «64 threads» mean?

The thread pool is for CPU threads and is not generally related with CUDA.
Thread pooling is the technology that allows one to reuse old
threads from the pool without creating the new ones. This allows
faster processing in some cases. Typically, Firefly uses lots of threads, esp.
running under Windows. Firefly version 7.1.G adds the capability
to pool CPU threads for future reuse. E.g., CUDA-enabled threads
must be pooled, with each thread corresponding to a separate CUDA
device (note you have just a single device). Other threads (I/O,
compute, etc...) are now pooled by default as well.

Pooling requires some additional resources for its internal tables,
and tpool=64 defines the maximum number of threads that can be
pooled (stored in the pool). This feature can be completely disabled
by setting $smp tpool=0. However, this will also disable CUDA support.

Note, at present, CUDA is used only for MP4 calculations and
only running under Windows.

Hope this helps.

Alex Granovsky

On Tue Aug 10 '10 1:01pm, Ruslan wrote
>Thank you for help about using CUDA on gtx 260.
>But it uses only 64 cores or I misunderstood output file?

>Here is part from my output:

>Loaded CUDA/Firefly interface version 0x00010004

> Dev #    Name          DP capable  Async  Timeout  Emulated  Useable
>  0   GeForce GTX 260      Yes      Yes       No       No      Yes
> Initializing   1 CUDA-enabled threads (devices: 0x00000001) in the background

> CUDA driver version : 0x00000BC2, runtime version : 0x00000BC2

> CUDA init completed.

> DGEMM will use                                   4 threads.

> Matrix diagonalization and inversion will use    4 threads.

> SMP/multicore aware parts of program will use    4 threads.

> CUDA aware parts of program will use             1 devices.

> Creating thread pool to serve up to             64 threads.

>What does «64 threads» mean?

>Also, CUDA boosts calculation speed only when mp=4? And why gtx 260 didn't get hotter? (comparising with other CUDA-based aplications)
>Ruslan Muhamadejev.

>On Tue Feb 16 '10 7:03pm, Alex Granovsky wrote
>>Dear Veinardi,

>>Sorry for not documenting CUDA-related options for so long time.
>>The systems we used have multiple CUDA devices and thus to get
>>the optimal performance we used lots of Firefly-CUDA specific options.

>>They are not needed in your case. The proper input file would be simply:


 $contrl scftyp=rhf mplevl=4 runtyp=energy icharg=-1 $end                       
 $system mwords=140 $end 
! to use four cores for dgemm (mklnp) while only three CPU working threads otherwise (np)
 $system mklnp=4 np=3 $end
! to allow CUDA support and CUDA working threads 
 $smp cuda=.t. $end            
! cumask is the bitmask of available CUDA devices to use 
! (default is -1 i.e. to use the first available CUDA device)
 $cuda cumask=0x1 $end
 $basis gbasis=n311 ngauss=6 npfunc=2 ndfunc=2 diffsp=.t. $end                  
 $scf dirscf=1 $end                                                             
 $mp4 sdtq=1 $end                                               

 CL         17.0  -0.3520333657  -0.1650980028  -0.0471638329                   
 H           1.0   0.7972862956  -1.4281273262  -1.2294694043                   
 H           1.0  -1.5889984100   1.0741294712  -1.1576600606                   
 H           1.0  -1.5392863051  -1.2871081468   1.2265482694                   
 H           1.0   0.7428387888   0.9385578775   1.0969088438                   
 F           9.0   1.3109647391  -2.0051605917  -1.7641455975                   
 F           9.0  -2.1531089472   1.6362757807  -1.6569055973                   
 F           9.0  -2.0787314212  -1.7943015141   1.8058474018                   
 F           9.0   1.2619562758   1.4885572294   1.6876526453                   
 H           1.0   2.8342698106   1.9875432625   1.3734980540                   
 F           9.0   3.7058425393   2.3127319602   1.2848892783                   

>>Note you should not run it in parallel on the standalone
>>computer system as this run uses multithreading rather that
>>MPI or P2P level parallelism.

>>Finally, you need to disable TDR (the text below is taken from CUDA_Release_Notes_2.2.txt):


o Individual kernels are limited to a 2-second runtime by Windows
  Vista. Kernels that run for longer than 2 seconds will trigger
  the Timeout Detection and Recovery (TDR) mechanism. For more
  information, see

  GPUs without a display attached are not subject to the 2 second
  runtime restriction. For this reason it is recommended that
  CUDA be run on a GPU that is NOT attached to a display and
  does not have the Windows desktop extended onto it. In this
  case, the system must contain at least one NVIDIA GPU that
  serves as the primary graphics adapter. Thus, for devices like S1070
  that do not have an attached display, users may disable the Windows TDR
  timeout. Disabling the TDR timeout will allow kernels to run for
  extended periods of time without triggering an error.

  The following is an example .reg script:
    Windows Registry Editor Version 5.00


>>Hope this helps.


>>P.S. The trap address is outside of Firefly's code and is most
>>likely inside CUDA or other NVidia's dlls - indeed it's our
>>experience that CUDA libs/drivers (at least their Windows
>>implementation) do not like when multiple processes initialize
>>CUDA device at the same time - e.g. sometimes this results in BSOD...

[ This message was edited on Sun Aug 15 '10 at 7:18pm by the author ]

[ Previous ] [ Next ] [ Index ]           Sun Aug 15 '10 7:18pm
[ Reply ] [ Edit ] [ Delete ]           This message read 1314 times