I'd like to test Firefly/CUDA performance in the MP4(SDTQ) calculation on a Linux Core i7 workstation with 6 cores (+6 via HT?) , 24 GiB RAM and three GTX 480 cards. I've already done a calculation using CPU only, with memory=450000000 and 6 processes via MPI that took app. 2 hours with the latest 8.x beta version of Firefly.
Now I'm a doubt what should I use for mklnp, np and/or other keys, even undocumented, to get better performance and to use GPU cards is shared mode(?). As far as I understand, some parts of the input should look like
$cuda cumask=7 $end
$smp httnp=1(2?) cuda=.t. $end
Also, I believe that the usage of parallel execution here is crucial, because more memory may be allocated and integral transformations may be done in fewer passes. Is this true? If I use mklnp=9 np=6 and run in serial mode, CPU usage is actually 600% but I haven't got energy output after waiting for 4 hours.
The system has NCORE= 10 NOCC= 38 NAOS= 366 if it matters.