Just CUDing: Firefly v. 7.1.G MP4(SDTQ) code benchmarks and performance on NVidia GPUs
or
The real state-of-the-art in the Highly Correlated Quantum Chemistry.


Preface

Since version 5.2 (May 1999), the PC GAMESS (now Firefly) had included closed-shell MP4(full) (a.k.a. MP4(SDTQ) or MP4-SDTQ) module, which we believe, was a kind of the real state of the art. E.g., it uses the minimal possible (to the best of our knowledge) number of floating point operations (to be precise, Nvirt3*(Nvirt+Nocc)*(Nocc3-Nocc) additions and the same number of multiplications for its MP4(T) part in the case of no symmetry; formulated exclusively as calls to the matrix-matrix multiplication routines), fully exploits any abelian symmetry, supports SMP via multithreading/shared memory, and is capable to efficiently run in parallel. Once written, the code has not been changed for ten years... but well, now in 2009 there is the time of change... so we decided to add support of CUDA to the existing "gold" code. To get the most of CUDA on MP4 jobs was (and still is) the real challenge as the size of intermediate matrices is typically not too large, while there are lots of (unavoidable at moment) memory transfers back and forth between host and device(s). On this page, we are glad to present the results of our recent development in this field.


Results

 

Table I. Standard MP4(SDTQ) benchmark, time/performance data for the most expensive N7 MP4(T) step (43.057×1012 Double Precision Floating Point operations using DGEMM, size of matrices: 217x193).

Intel Core i7 940 2.93 GHz with two GeForce GTX 280 boards on PCIe 2.0 x16 links

 

 

Number of CPU cores used

Number of GPGPUs used

0

1

2

3

4

0

 

0.00 GFlop/s

4014.67 seconds

10.72 GFlop/s

2110.17 seconds

20.40 GFlop/s

1421.91 seconds

30.28 GFlop/s

1087.36 seconds

39.60 GFlop/s

1

1357.28 seconds

31.72 GFlop/s

1036.00 seconds

41.56 GFlop/s

835.37 seconds

51.54 GFlop/s

709.66 seconds

60.67 GFlop/s

668.66a) seconds

64.39a) GFlop/s

2

697.77 seconds

61.70 GFlop/s

611.87 seconds

70.37 GFlop/s

542.39 seconds

79.38 GFlop/s

508.21a) (495.38b)) seconds

84.72a) (86.91b)) GFlop/s

487.88a) seconds

88.25a) GFlop/s

 

 

 

Table II. Standard MP4(SDTQ) benchmark, time/performance data for the most expensive N7 MP4(T) step (43.057×1012 Double Precision Floating Point operations using DGEMM, size of matrices: 217x193).

AMD Phenom II X4 955 3.2 GHz with one GeForce GTX 295 on PCIe 2.0 x16 link (configured as two independent CUDA devices) and two Tesla C1060 boards on PCIe 2.0 x8 links

 

 

Number of CPU cores used

Number of GPGPUs used

0

1

2

3

4

0

 

0.00 GFlop/s

4192.02 seconds

10.27 GFlop/s

2276.03 seconds

18.92 GFlop/s

1470.25 seconds

29.29 GFlop/s

1142.69 seconds

37.68 GFlop/s

1 (GTX 295)

1489.03 seconds

28.92 GFlop/s

1107.10 seconds

38.89 GFlop/s

928.72 seconds

46.36 GFlop/s

773.87 seconds

55.64 GFlop/s

n/a

2 (GTX 295)

779.90 seconds

55.21 GFlop/s

685.25 seconds

62.83 GFlop/s

623.60 seconds

69.05 GFlop/s

n/a

n/a

3 (GTX 295 + one C1060)

550.60 seconds

78.20 GFlop/s

520.91 seconds

82.66 GFlop/s

n/a

n/a

n/a

4 (GTX 295 + two C1060)

446.65 seconds

96.40 GFlop/s

n/a

n/a

n/a

n/a

 

 




OS and hardware description


Intel Quad-core Core i7 940 2.93 GHz, Asus P6T Deluxe Mainboard, 6x2GB (triple channel) DDR3-1333 RAM@1333 MHz, four Seagate 1 TB SATA-2 ST31000340NS HDDs, Windows Server 2008 Enterprise x64 SP 2. Hyperthreading and Turbo Boost Technology were enabled in BIOS. NVidia CUDA SDK version 2.2, CUDA driver version : 0x000007E4, runtime version : 0x000007E4.

AMD Quad-core Phenom II X4 955 Black Edition 3.2 GHz, Asus M4A79T Deluxe Mainboard, 4x2GB (dual channel, unganged mode) DDR3-1333 ECC unbuffered RAM@1333 MHz, four WDC 1 TB SATA-2 WD10 02FBYS-02A6B0 HDDs, Windows Server 2008 Enterprise x64 SP 2. NVidia CUDA SDK version 2.2, CUDA driver version : 0x000007E4, runtime version : 0x000007E4.

GeForce GTX 280, GeForce GTX 295, and Tesla C1060 boards were set to their factory shipped frequencies.



Tests description

Single-point direct MP4-SDTQ (i.e, MP4(full)) energy for small molecular cluster (Cl-(HF)5, Cartesian 6-311+G(2d,2p) basis set, 227 AOs, 34 occupied (10 frozen core) MOs, 193 virtual MOs View image

Compressed output files and sample input are available here

More data on standard MP4(SDTQ) benchmark

Test comments


All calculations were performed by the single Firefly process running in multithreaded mode with dynamic load balancing between all types of compute threads, both purely CPU and CUDA-enabled ones. Unless explicitly stated otherwise, all benchmarks used only single logical processor of each allotted CPU core. Call64 switch was turned on for all tests for faster CPU processing. Cuda-enabled threads fully overlap GPU computations, CPU computations, and asynchronous memory transfers between host and CUDA devices. Time is the Wall clock time for MP4(T) part in seconds.


Hot results (as of September 7th, 2009)

Running Firefly in parallel on these two systems (commodity 1 Gb Ethernet switch, NT-MPICH v. 1.5) in the double dynamic load balancing mode (i.e., both between compute threads within each process and between each of two instances of the entire parallel process), the results are as follows:



Notes to the blue cells in Table I:


Notes to the red cells in Table II:



We are grateful to Brent Oster (NVidia Corporation) for donation of two Tesla C1060 boards to the PC GAMESS/Firefly team.

Our big thanks to Andrey Ovdenko (Intercom-PC Moscow) for his excellent technical support.


Copyright © 2009 by Alex A. Granovsky


Other pages containing results for the same benchmark



Back to the Firefly and PC GAMESS performance page

Back to the Firefly homepage