Just CUDing: Firefly v. 7.1.G standard MP4(SDTQ) benchmark on NVidia GPUs or The real state-of-the-art in the Highly Correlated Quantum Chemistry.

Since version 5.2 (May 1999), the PC GAMESS (now Firefly) had included closed-shell MP4(full) (a.k.a. MP4(SDTQ) or MP4-SDTQ) module, which we believe, was a kind of the real state of the art. E.g., it uses the minimal possible (to the best of our knowledge) number of floating point operations (to be precise, N_virt³*(N_virt+N_occ)*(N_occ³-N_occ) additions and the same number of multiplications for its MP4(T) part in the case of no symmetry; formulated exclusively as calls to the matrix-matrix multiplication routines), fully exploits any abelian symmetry, supports SMP via multithreading/shared memory, and is capable to efficiently run in parallel. Once written, the code has not been changed for ten years... but well, now in 2009 there is the time of change... so we decided to add support of CUDA to the existing "gold" code. To get the most of CUDA on MP4 jobs was (and still is) the real challenge as the size of intermediate matrices is typically not too large, while there are lots of (unavoidable at moment) memory transfers back and forth between host and device(s). On this page, we are glad to present the results of our recent development in this field.

Table I. Standard MP4(SDTQ) benchmark, time/performance data for the most expensive N⁷ MP4(T) step (43.057×10¹² Double Precision Floating Point operations using DGEMM, size of matrices: 217x193).

Intel Core i7 940 2.93 GHz with two GeForce GTX 280 boards on PCIe 2.0 x16 links

Number of CPU cores used

Number of GPGPUs used

0.00 GFlop/s

4014.67 seconds

10.72 GFlop/s

2110.17 seconds

20.40 GFlop/s

1421.91 seconds

30.28 GFlop/s

1087.36 seconds

39.60 GFlop/s

1357.28 seconds

31.72 GFlop/s

1036.00 seconds

41.56 GFlop/s

835.37 seconds

51.54 GFlop/s

709.66 seconds

60.67 GFlop/s

668.66^a) seconds

64.39^a) GFlop/s

697.77 seconds

61.70 GFlop/s

611.87 seconds

70.37 GFlop/s

542.39 seconds

79.38 GFlop/s

508.21^a) (495.38^b)) seconds

84.72^a) (86.91^b)) GFlop/s

487.88^a) seconds

88.25^a) GFlop/s

Table II. Standard MP4(SDTQ) benchmark, time/performance data for the most expensive N⁷ MP4(T) step (43.057×10¹² Double Precision Floating Point operations using DGEMM, size of matrices: 217x193).

AMD Phenom II X4 955 3.2 GHz with one GeForce GTX 295 on PCIe 2.0 x16 link (configured as two independent CUDA devices) and two Tesla C1060 boards on PCIe 2.0 x8 links

	Number of CPU cores used
Number of GPGPUs used	0	1	2	3	4
0	0.00 GFlop/s	4192.02 seconds 10.27 GFlop/s	2276.03 seconds 18.92 GFlop/s	1470.25 seconds 29.29 GFlop/s	1142.69 seconds 37.68 GFlop/s
1 (GTX 295)	1489.03 seconds 28.92 GFlop/s	1107.10 seconds 38.89 GFlop/s	928.72 seconds 46.36 GFlop/s	773.87 seconds 55.64 GFlop/s	n/a
2 (GTX 295)	779.90 seconds 55.21 GFlop/s	685.25 seconds 62.83 GFlop/s	623.60 seconds 69.05 GFlop/s	n/a	n/a
3 (GTX 295 + one C1060)	550.60 seconds 78.20 GFlop/s	520.91 seconds 82.66 GFlop/s	n/a	n/a	n/a
4 (GTX 295 + two C1060)	446.65 seconds 96.40 GFlop/s	n/a	n/a	n/a	n/a

Just CUDing: Firefly v. 7.1.G MP4(SDTQ) code benchmarks and performance on NVidia GPUs
or
The real state-of-the-art in the Highly Correlated Quantum Chemistry.

Preface

Results

OS and hardware description

Tests description

Test comments

Hot results (as of September 7^th, 2009)

Copyright © 2009 by Alex A. Granovsky

Other pages containing results for the same benchmark

Back to the Firefly and PC GAMESS performance page

Back to the Firefly homepage

Just CUDing: Firefly v. 7.1.G MP4(SDTQ) code benchmarks and performance on NVidia GPUs or The real state-of-the-art in the Highly Correlated Quantum Chemistry.