I'm running large MP4(SDTQ) calculations in parallel on a SMP system (quad-core i7 with HTT disabled, 6GB RAM, Windows 7 x64) using two processes (2 threads each, i.e. -NP 2 and MLKNP=2) and a single GTX 285 card. Here is a couple questions:
1) For some reason I cannot use more than 200MW RAM. When I set the memory limit above 200MW (which I can afford on my system), Firefly run terminates reporting that this amount of memory is not available. Note that this error appears irrespective of the way I run Firefly (i.e. parallel or serial) and of the number of cores used. Is this an inherent limitation of 32-bit Firefly, or am I doing something wrong here?
2) When running MP4(SDTQ) in parallel using a single CUDA device, the system behaves rather strangely. The serial SMP version works well and provides a substantial acceleration due to the use of CUDA. In these circumstances the CPU load is ca. 100% and GPU load is ca. 60-70% for the triples part. When I try to run the same job in parallel (-NP 2 and MLKNP=2) to speed up the HF part of the calculation, the system is barely responding. Despite the lower CPU load compared to the serial mode, the system is almost impossible to use, and there is a substantial performance degradation for the triples part of the calculation. So my question is: how is a single CUDA device handled during the parallel execution on a SMP system? Is it possible that the problems arise due to both processes trying to use a single CUDA simultaneously?