Since version 5.2 (May 1999), the PC GAMESS (now Firefly) had included closed-shell MP4(full) (a.k.a. MP4(SDTQ) or MP4-SDTQ) module, which we believe, was a kind of the real state of the art. E.g., it uses the minimal possible (to the best of our knowledge) number of floating point operations (to be precise, Nvirt3*(Nvirt+Nocc)*(Nocc3-Nocc) additions and the same number of multiplications for its MP4(T) part in the case of no symmetry; formulated exclusively as calls to the matrix-matrix multiplication routines), fully exploits any abelian symmetry, supports SMP via multithreading/shared memory, and is capable to efficiently run in parallel. Once written, the code has not been changed for ten years... but well, now in 2009 there is the time of change... so we decided to add support of CUDA to the existing "gold" code. To get the most of CUDA on MP4 jobs was (and still is) the real challenge as the size of intermediate matrices is typically not too large, while there are lots of (unavoidable at moment) memory transfers back and forth between host and device(s). On this page, we are glad to present the results of our recent development in this field.
Table I. Standard MP4(SDTQ) benchmark, time/performance data
for the most expensive N7 MP4(T) step (43.057×1012
Double Precision Floating Point operations using DGEMM, size of matrices: 217x193).
Intel Core i7 940 2.93 GHz with two GeForce GTX 280 boards on PCIe
2.0 x16 links
|
Number of CPU cores used |
||||
Number of GPGPUs used |
0 |
1 |
2 |
3 |
4 |
0 |
0.00 GFlop/s |
4014.67 seconds 10.72 GFlop/s |
2110.17 seconds 20.40 GFlop/s |
1421.91 seconds 30.28 GFlop/s |
1087.36 seconds 39.60 GFlop/s |
1 |
1357.28 seconds 31.72 GFlop/s |
1036.00 seconds 41.56 GFlop/s |
835.37 seconds 51.54 GFlop/s |
709.66 seconds 60.67 GFlop/s |
668.66a) seconds 64.39a) GFlop/s |
2 |
697.77 seconds 61.70 GFlop/s |
611.87 seconds 70.37 GFlop/s |
542.39 seconds 79.38 GFlop/s |
508.21a) (495.38b)) seconds 84.72a) (86.91b)) GFlop/s |
487.88a) seconds 88.25a) GFlop/s |
Table II. Standard MP4(SDTQ) benchmark, time/performance data
for the most expensive N7 MP4(T) step (43.057×1012
Double Precision Floating Point operations using DGEMM, size of matrices:
217x193).
AMD Phenom II X4 955 3.2 GHz with one GeForce GTX 295 on PCIe 2.0 x16
link (configured as two independent CUDA devices) and two Tesla C1060 boards on
PCIe 2.0 x8 links
|
Number of CPU cores used |
||||
Number of GPGPUs used |
0 |
1 |
2 |
3 |
4 |
0 |
0.00 GFlop/s |
4192.02 seconds 10.27 GFlop/s |
2276.03 seconds 18.92 GFlop/s |
1470.25 seconds 29.29 GFlop/s |
1142.69 seconds 37.68 GFlop/s |
1 (GTX 295) |
1489.03 seconds 28.92 GFlop/s |
1107.10 seconds 38.89 GFlop/s |
928.72 seconds 46.36 GFlop/s |
773.87 seconds 55.64 GFlop/s |
n/a |
2 (GTX 295) |
779.90 seconds 55.21 GFlop/s |
685.25 seconds 62.83 GFlop/s |
623.60 seconds 69.05 GFlop/s |
n/a |
n/a |
3 (GTX 295 + one C1060) |
550.60 seconds 78.20 GFlop/s |
520.91 seconds 82.66 GFlop/s |
n/a |
n/a |
n/a |
4 (GTX 295 + two C1060) |
446.65 seconds 96.40 GFlop/s |
n/a |
n/a |
n/a |
n/a |
Intel Quad-core Core i7 940 2.93 GHz, Asus P6T Deluxe Mainboard, 6x2GB (triple channel) DDR3-1333 RAM@1333 MHz, four Seagate 1 TB SATA-2 ST31000340NS HDDs, Windows Server 2008 Enterprise x64 SP 2. Hyperthreading and Turbo Boost Technology were enabled in BIOS. NVidia CUDA SDK version 2.2, CUDA driver version : 0x000007E4, runtime version : 0x000007E4.
AMD Quad-core Phenom II X4 955 Black Edition 3.2 GHz, Asus M4A79T Deluxe Mainboard,
4x2GB (dual channel, unganged mode) DDR3-1333 ECC unbuffered RAM@1333 MHz, four WDC 1 TB SATA-2 WD10
GeForce GTX 280, GeForce GTX 295, and Tesla C1060 boards were set to their factory shipped frequencies.
Single-point direct MP4-SDTQ (i.e, MP4(full)) energy for small molecular cluster (Cl-(HF)5, Cartesian 6-311+G(2d,2p) basis set, 227 AOs, 34 occupied (10 frozen core) MOs, 193 virtual MOs View image
Compressed output files and sample input are available here
All calculations were performed by the single Firefly process running in multithreaded mode with dynamic load balancing between all types of compute threads, both purely CPU and CUDA-enabled ones. Unless explicitly stated otherwise, all benchmarks used only single logical processor of each allotted CPU core. Call64 switch was turned on for all tests for faster CPU processing. Cuda-enabled threads fully overlap GPU computations, CPU computations, and asynchronous memory transfers between host and CUDA devices. Time is the Wall clock time for MP4(T) part in seconds.
Running Firefly in parallel on these two systems (commodity 1 Gb Ethernet switch, NT-MPICH v. 1.5) in the double dynamic load balancing mode (i.e., both between compute threads within each process and between each of two instances of the entire parallel process), the results are as follows:
561.00 seconds (76.75 GFlop/s) using all available CPU cores for computations and no GPUs at all
274.68 seconds (156.75 GFlop/s) using all GPUs for computations and no CPU cores at all
234.48 seconds (183.63 GFlop/s) using all GPUs and four CPU cores of Core i7 (thanks to its hyperthreading, AMD clearly needs to implement this feature as well! Without hyperthreading it is not possible to efficiently use four Phenom's cores and four CUDA devices for computations on the same system at the same time)
Notes to the blue cells in Table I:
Notes to the red cells in Table II:
We are grateful to Brent Oster (NVidia Corporation) for donation of two Tesla C1060 boards to the PC GAMESS/Firefly team.
Our big thanks to Andrey Ovdenko (Intercom-PC Moscow) for his excellent technical support.