Ceilidh ... Re^4: Firefly 7.1G with CUDA 2.3

Firefly and PC GAMESS-related discussion club

Re^4: Firefly 7.1G with CUDA 2.3

Hi,

one additional comment on Firefly/CUDA interoperability -
you need to have 32-bit CUDA Toolkit installed on your system.
The error message you have encountered is caused by the OS
attempt to load 64-bit CUDA dlls into 32-bit program. What
is the OS version you are using? It is possible to have both
32-bit and 64-bit toolkits installed at the same time.

Actually, what you need is just to put 32-bit cudart.dll
and cublas.dll into Firefly's folder. As this is allowed by
CUDA EULA, I've attached the required dlls to this post.
They are packed using the same password as the Firefly v. 7.1.G
distribution files. So, there is no need to install toolkit...
just pick up use these dlls.

Hope this helps.

Alex

P.S. There is no any negative impact on GPU performance
due to use 32-bit rather than 64-bit CUDA interface
libraries.

On Tue Feb 16 '10 7:03pm, Alex Granovsky wrote
----------------------------------------------
>Dear Veinardi,

>Sorry for not documenting CUDA-related options for so long time.
>The systems we used have multiple CUDA devices and thus to get
>the optimal performance we used lots of Firefly-CUDA specific options.

>They are not needed in your case. The proper input file would be simply:

 $contrl scftyp=rhf mplevl=4 runtyp=energy icharg=-1 $end                       
 $system mwords=140 $end 
! to use four cores for dgemm (mklnp) while only three CPU working threads otherwise (np)
 $system mklnp=4 np=3 $end
! to allow CUDA support and CUDA working threads 
 $smp cuda=.t. $end            
! cumask is the bitmask of available CUDA devices to use 
! (default is -1 i.e. to use the first available CUDA device)
 $cuda cumask=0x1 $end
 $basis gbasis=n311 ngauss=6 npfunc=2 ndfunc=2 diffsp=.t. $end                  
 $scf dirscf=1 $end                                                             
 $mp4 sdtq=1 $end                                               
 $data

 C1                                                                             
 CL         17.0  -0.3520333657  -0.1650980028  -0.0471638329                   
 H           1.0   0.7972862956  -1.4281273262  -1.2294694043                   
 H           1.0  -1.5889984100   1.0741294712  -1.1576600606                   
 H           1.0  -1.5392863051  -1.2871081468   1.2265482694                   
 H           1.0   0.7428387888   0.9385578775   1.0969088438                   
 F           9.0   1.3109647391  -2.0051605917  -1.7641455975                   
 F           9.0  -2.1531089472   1.6362757807  -1.6569055973                   
 F           9.0  -2.0787314212  -1.7943015141   1.8058474018                   
 F           9.0   1.2619562758   1.4885572294   1.6876526453                   
 H           1.0   2.8342698106   1.9875432625   1.3734980540                   
 F           9.0   3.7058425393   2.3127319602   1.2848892783                   
 $end

>Note you should not run it in parallel on the standalone
>computer system as this run uses multithreading rather that
>MPI or P2P level parallelism.

>Finally, you need to disable TDR (the text below is taken from CUDA_Release_Notes_2.2.txt):

o Individual kernels are limited to a 2-second runtime by Windows
  Vista. Kernels that run for longer than 2 seconds will trigger
  the Timeout Detection and Recovery (TDR) mechanism. For more
  information, see
  http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx.

  GPUs without a display attached are not subject to the 2 second
  runtime restriction. For this reason it is recommended that
  CUDA be run on a GPU that is NOT attached to a display and
  does not have the Windows desktop extended onto it. In this
  case, the system must contain at least one NVIDIA GPU that
  serves as the primary graphics adapter. Thus, for devices like S1070
  that do not have an attached display, users may disable the Windows TDR
  timeout. Disabling the TDR timeout will allow kernels to run for
  extended periods of time without triggering an error.

  The following is an example .reg script:
  
    Windows Registry Editor Version 5.00

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
    "TdrLevel"=dword:00000000

>Hope this helps.

>Regards,
>Alex

>P.S. The trap address is outside of Firefly's code and is most
>likely inside CUDA or other NVidia's dlls - indeed it's our
>experience that CUDA libs/drivers (at least their Windows
>implementation) do not like when multiple processes initialize
>CUDA device at the same time - e.g. sometimes this results in BSOD...
>
>
>On Tue Feb 16 '10 4:31am, Veinardi Suendo wrote
>-----------------------------------------------
>>Dear Alex,

>>Thank you very much for your help. I think it is due to some wrong instruction in input file. The problem is I do not have any reference for CUDA instruction in this new version of Firefly. So I took your input file and do some modifications, but I am not sure that my modification is correct. The program gave such message: "The image file is valid, but is for a machine type other than the current machine. Select OK to continue, or CANCEL to fail the DLL load." Here I include the input, output and punch files as well. I do hope that you can give me a solution.

>>Thank you in advance,

>>Veinardi

>>On Mon Feb 15 '10 2:58pm, Alex Granovsky wrote
>>----------------------------------------------
>>>Hi,

>>>the code works with CUDA SDK 2.3. Could you please share the
>>>exact input and output files?

>>>Regards,
>>>Alex Granovsky

>>>On Mon Feb 15 '10 7:16am, Veinardi Suendo wrote
>>>-----------------------------------------------
>>>>Dear Colleagues,

>>>>I have just tried to run a calculation based on the benchmark test of CUDA (http://classic.chem.msu.su/gran/gamess/cuding.html), but it failed on our machine. The code said that the dll files were not compatible. I do not know whether it is due to the different CUDA version (we use v2.3) or due to some specific option for each type of NVIDIA Card as written in the input file. Here, we used the cheapest one among GTX200 series: GTX260 made by Manli. We had tested this card to work with trial version of Jacket run on Matlab and everything goes well.

>>>>Please, if any of you have any suggestions, we need this option to accelerate the geometry optimization and vibration analysis.

>>>>Thank you in advance,

>>>>Yours Sincerely,

>>>>Veinardi Suendo

This message contains the 1148 kb attachment
[ cuda23.rar ]

Thu Feb 18 '10 1:08pm

This message read 1316 times