Here is part from my output:
Loaded CUDA/Firefly interface version 0x00010004
Dev # Name DP capable Async Timeout Emulated Useable
0 GeForce GTX 260 Yes Yes No No Yes
Initializing 1 CUDA-enabled threads (devices: 0x00000001) in the background
CUDA driver version : 0x00000BC2, runtime version : 0x00000BC2
CUDA init completed.
DGEMM will use 4 threads.
Matrix diagonalization and inversion will use 4 threads.
SMP/multicore aware parts of program will use 4 threads.
CUDA aware parts of program will use 1 devices.
Creating thread pool to serve up to 64 threads.
What does «64 threads» mean?
Also, CUDA boosts calculation speed only when mp=4? And why gtx 260 didn't get hotter? (comparising with other CUDA-based aplications)
On Tue Feb 16 '10 7:03pm, Alex Granovsky wrote
>Sorry for not documenting CUDA-related options for so long time.
>The systems we used have multiple CUDA devices and thus to get
>the optimal performance we used lots of Firefly-CUDA specific options.
>They are not needed in your case. The proper input file would be simply:
$contrl scftyp=rhf mplevl=4 runtyp=energy icharg=-1 $end $system mwords=140 $end ! to use four cores for dgemm (mklnp) while only three CPU working threads otherwise (np) $system mklnp=4 np=3 $end ! to allow CUDA support and CUDA working threads $smp cuda=.t. $end ! cumask is the bitmask of available CUDA devices to use ! (default is -1 i.e. to use the first available CUDA device) $cuda cumask=0x1 $end $basis gbasis=n311 ngauss=6 npfunc=2 ndfunc=2 diffsp=.t. $end $scf dirscf=1 $end $mp4 sdtq=1 $end $data C1 CL 17.0 -0.3520333657 -0.1650980028 -0.0471638329 H 1.0 0.7972862956 -1.4281273262 -1.2294694043 H 1.0 -1.5889984100 1.0741294712 -1.1576600606 H 1.0 -1.5392863051 -1.2871081468 1.2265482694 H 1.0 0.7428387888 0.9385578775 1.0969088438 F 9.0 1.3109647391 -2.0051605917 -1.7641455975 F 9.0 -2.1531089472 1.6362757807 -1.6569055973 F 9.0 -2.0787314212 -1.7943015141 1.8058474018 F 9.0 1.2619562758 1.4885572294 1.6876526453 H 1.0 2.8342698106 1.9875432625 1.3734980540 F 9.0 3.7058425393 2.3127319602 1.2848892783 $end
>Note you should not run it in parallel on the standalone
>computer system as this run uses multithreading rather that
>MPI or P2P level parallelism.
>Finally, you need to disable TDR (the text below is taken from CUDA_Release_Notes_2.2.txt):
o Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. For more information, see http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx. GPUs without a display attached are not subject to the 2 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter. Thus, for devices like S1070 that do not have an attached display, users may disable the Windows TDR timeout. Disabling the TDR timeout will allow kernels to run for extended periods of time without triggering an error. The following is an example .reg script: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers] "TdrLevel"=dword:00000000
>Hope this helps.
>P.S. The trap address is outside of Firefly's code and is most
>likely inside CUDA or other NVidia's dlls - indeed it's our
>experience that CUDA libs/drivers (at least their Windows
>implementation) do not like when multiple processes initialize
>CUDA device at the same time - e.g. sometimes this results in BSOD...