Ceilidh ... Re^5: Firefly 7.1G with CUDA 2.3

Firefly and PC GAMESS-related discussion club

Re^5: Firefly 7.1G with CUDA 2.3

Hi Ruslan,

> Creating thread pool to serve up to 64 threads.
>What does Ť64 threadsť mean?

The thread pool is for CPU threads and is not generally related with CUDA.
Thread pooling is the technology that allows one to reuse old
threads from the pool without creating the new ones. This allows
faster processing in some cases. Typically, Firefly uses lots of threads, esp.
running under Windows. Firefly version 7.1.G adds the capability
to pool CPU threads for future reuse. E.g., CUDA-enabled threads
must be pooled, with each thread corresponding to a separate CUDA
device (note you have just a single device). Other threads (I/O,
compute, etc...) are now pooled by default as well.

Pooling requires some additional resources for its internal tables,
and tpool=64 defines the maximum number of threads that can be
pooled (stored in the pool). This feature can be completely disabled
by setting $smp tpool=0. However, this will also disable CUDA support.

Note, at present, CUDA is used only for MP4 calculations and
only running under Windows.

Hope this helps.

Regards,
Alex Granovsky

On Tue Aug 10 '10 1:01pm, Ruslan wrote
--------------------------------------
>Thank you for help about using CUDA on gtx 260.
>But it uses only 64 cores or I misunderstood output file?

>Here is part from my output:

>Loaded CUDA/Firefly interface version 0x00010004

> Dev # Name DP capable Async Timeout Emulated Useable
> 0 GeForce GTX 260 Yes Yes No No Yes
>
>
> Initializing 1 CUDA-enabled threads (devices: 0x00000001) in the background

> CUDA driver version : 0x00000BC2, runtime version : 0x00000BC2

> CUDA init completed.

> DGEMM will use 4 threads.

> Matrix diagonalization and inversion will use 4 threads.

> SMP/multicore aware parts of program will use 4 threads.

> CUDA aware parts of program will use 1 devices.

> Creating thread pool to serve up to 64 threads.

>What does Ť64 threadsť mean?

>Also, CUDA boosts calculation speed only when mp=4? And why gtx 260 didn't get hotter? (comparising with other CUDA-based aplications)
>
>
>Regards,
>Ruslan Muhamadejev.

>On Tue Feb 16 '10 7:03pm, Alex Granovsky wrote
>----------------------------------------------
>>Dear Veinardi,

>>Sorry for not documenting CUDA-related options for so long time.
>>The systems we used have multiple CUDA devices and thus to get
>>the optimal performance we used lots of Firefly-CUDA specific options.

>>They are not needed in your case. The proper input file would be simply:

 $contrl scftyp=rhf mplevl=4 runtyp=energy icharg=-1 $end                       
 $system mwords=140 $end 
! to use four cores for dgemm (mklnp) while only three CPU working threads otherwise (np)
 $system mklnp=4 np=3 $end
! to allow CUDA support and CUDA working threads 
 $smp cuda=.t. $end            
! cumask is the bitmask of available CUDA devices to use 
! (default is -1 i.e. to use the first available CUDA device)
 $cuda cumask=0x1 $end
 $basis gbasis=n311 ngauss=6 npfunc=2 ndfunc=2 diffsp=.t. $end                  
 $scf dirscf=1 $end                                                             
 $mp4 sdtq=1 $end                                               
 $data

 C1                                                                             
 CL         17.0  -0.3520333657  -0.1650980028  -0.0471638329                   
 H           1.0   0.7972862956  -1.4281273262  -1.2294694043                   
 H           1.0  -1.5889984100   1.0741294712  -1.1576600606                   
 H           1.0  -1.5392863051  -1.2871081468   1.2265482694                   
 H           1.0   0.7428387888   0.9385578775   1.0969088438                   
 F           9.0   1.3109647391  -2.0051605917  -1.7641455975                   
 F           9.0  -2.1531089472   1.6362757807  -1.6569055973                   
 F           9.0  -2.0787314212  -1.7943015141   1.8058474018                   
 F           9.0   1.2619562758   1.4885572294   1.6876526453                   
 H           1.0   2.8342698106   1.9875432625   1.3734980540                   
 F           9.0   3.7058425393   2.3127319602   1.2848892783                   
 $end

>>Note you should not run it in parallel on the standalone
>>computer system as this run uses multithreading rather that
>>MPI or P2P level parallelism.

>>Finally, you need to disable TDR (the text below is taken from CUDA_Release_Notes_2.2.txt):

o Individual kernels are limited to a 2-second runtime by Windows
  Vista. Kernels that run for longer than 2 seconds will trigger
  the Timeout Detection and Recovery (TDR) mechanism. For more
  information, see
  http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx.

  GPUs without a display attached are not subject to the 2 second
  runtime restriction. For this reason it is recommended that
  CUDA be run on a GPU that is NOT attached to a display and
  does not have the Windows desktop extended onto it. In this
  case, the system must contain at least one NVIDIA GPU that
  serves as the primary graphics adapter. Thus, for devices like S1070
  that do not have an attached display, users may disable the Windows TDR
  timeout. Disabling the TDR timeout will allow kernels to run for
  extended periods of time without triggering an error.

  The following is an example .reg script:
  
    Windows Registry Editor Version 5.00

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
    "TdrLevel"=dword:00000000

>>Hope this helps.

>>Regards,
>>Alex

>>P.S. The trap address is outside of Firefly's code and is most
>>likely inside CUDA or other NVidia's dlls - indeed it's our
>>experience that CUDA libs/drivers (at least their Windows
>>implementation) do not like when multiple processes initialize
>>CUDA device at the same time - e.g. sometimes this results in BSOD...
>

[ This message was edited on Sun Aug 15 '10 at 7:18pm by the author ]

Sun Aug 15 '10 7:18pm

This message read 1341 times