Firefly and PC GAMESS-related discussion club



Learn how to ask questions correctly


Re^4: Firefly 7.1G with CUDA 2.3

Ruslan
muhamadejev@gmail.com


Thank you for help about using CUDA on gtx 260.
But it uses only 64 cores or I misunderstood output file?

Here is part from my output:

Loaded CUDA/Firefly interface version 0x00010004

Dev #    Name          DP capable  Async  Timeout  Emulated  Useable
 0   GeForce GTX 260      Yes      Yes       No       No      Yes


Initializing   1 CUDA-enabled threads (devices: 0x00000001) in the background

CUDA driver version : 0x00000BC2, runtime version : 0x00000BC2

CUDA init completed.

DGEMM will use                                   4 threads.

Matrix diagonalization and inversion will use    4 threads.

SMP/multicore aware parts of program will use    4 threads.

CUDA aware parts of program will use             1 devices.

Creating thread pool to serve up to             64 threads.

What does «64 threads» mean?

Also, CUDA boosts calculation speed only when mp=4? And why gtx 260 didn't get hotter? (comparising with other CUDA-based aplications)


Regards,
Ruslan Muhamadejev.

On Tue Feb 16 '10 7:03pm, Alex Granovsky wrote
----------------------------------------------
>Dear Veinardi,

>Sorry for not documenting CUDA-related options for so long time.
>The systems we used have multiple CUDA devices and thus to get
>the optimal performance we used lots of Firefly-CUDA specific options.

>They are not needed in your case. The proper input file would be simply:

>

 $contrl scftyp=rhf mplevl=4 runtyp=energy icharg=-1 $end                       
 $system mwords=140 $end 
! to use four cores for dgemm (mklnp) while only three CPU working threads otherwise (np)
 $system mklnp=4 np=3 $end
! to allow CUDA support and CUDA working threads 
 $smp cuda=.t. $end            
! cumask is the bitmask of available CUDA devices to use 
! (default is -1 i.e. to use the first available CUDA device)
 $cuda cumask=0x1 $end
 $basis gbasis=n311 ngauss=6 npfunc=2 ndfunc=2 diffsp=.t. $end                  
 $scf dirscf=1 $end                                                             
 $mp4 sdtq=1 $end                                               
 $data

 C1                                                                             
 CL         17.0  -0.3520333657  -0.1650980028  -0.0471638329                   
 H           1.0   0.7972862956  -1.4281273262  -1.2294694043                   
 H           1.0  -1.5889984100   1.0741294712  -1.1576600606                   
 H           1.0  -1.5392863051  -1.2871081468   1.2265482694                   
 H           1.0   0.7428387888   0.9385578775   1.0969088438                   
 F           9.0   1.3109647391  -2.0051605917  -1.7641455975                   
 F           9.0  -2.1531089472   1.6362757807  -1.6569055973                   
 F           9.0  -2.0787314212  -1.7943015141   1.8058474018                   
 F           9.0   1.2619562758   1.4885572294   1.6876526453                   
 H           1.0   2.8342698106   1.9875432625   1.3734980540                   
 F           9.0   3.7058425393   2.3127319602   1.2848892783                   
 $end                                                                           

>Note you should not run it in parallel on the standalone
>computer system as this run uses multithreading rather that
>MPI or P2P level parallelism.

>Finally, you need to disable TDR (the text below is taken from CUDA_Release_Notes_2.2.txt):

>

o Individual kernels are limited to a 2-second runtime by Windows
  Vista. Kernels that run for longer than 2 seconds will trigger
  the Timeout Detection and Recovery (TDR) mechanism. For more
  information, see
  http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx.

  GPUs without a display attached are not subject to the 2 second
  runtime restriction. For this reason it is recommended that
  CUDA be run on a GPU that is NOT attached to a display and
  does not have the Windows desktop extended onto it. In this
  case, the system must contain at least one NVIDIA GPU that
  serves as the primary graphics adapter. Thus, for devices like S1070
  that do not have an attached display, users may disable the Windows TDR
  timeout. Disabling the TDR timeout will allow kernels to run for
  extended periods of time without triggering an error.

  The following is an example .reg script:
  
    Windows Registry Editor Version 5.00

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
    "TdrLevel"=dword:00000000

>Hope this helps.

>Regards,
>Alex

>P.S. The trap address is outside of Firefly's code and is most
>likely inside CUDA or other NVidia's dlls - indeed it's our
>experience that CUDA libs/drivers (at least their Windows
>implementation) do not like when multiple processes initialize
>CUDA device at the same time - e.g. sometimes this results in BSOD...


[ Previous ] [ Next ] [ Index ]           Tue Aug 10 '10 1:01pm
[ Reply ] [ Edit ] [ Delete ]           This message read 996 times