Number of cores used |
1 |
2 (a) |
2 (b) |
3 (b) |
4 (b) |
5 (b) |
6 (b) |
7 (b) |
8 (b) |
Using all 16 logical processors |
Test 1, CPU time and relative speedup |
2118.1 100%
2114.5 100% |
1066.8 199%
1063.6 199% |
1113.2 190%
1109.2 191% |
753.1 281%
751.6 281% |
572.7 370%
570.5 371% |
455.2 465%
452.4 467% |
388.3 545%
386.3 547% |
340.9 621%
337.9 626% |
303.8 697%
302.1 700% |
315.9 670%
304.6 694% |
Test 2, Wall clock time and relative speedup |
105.3 100%
104.9 100% |
65.5 161%
65.2 161% |
65.6 160%
65.2 161% |
52.8 199%
53.8 195% |
46.0 229%
46.7 225% |
42.6 247%
42.2 249% |
40.1 263%
39.9 263% |
37.9 278%
37.8 278% |
37.0 285%
36.9 284% |
41.7 253%
44.0 238% |
Test 3, CPU time and relative speedup |
3811.0 100%
3805.7 100% |
1835.3 208%
1829.4 208% |
1905.4 200%
1894.6 201% |
1275.1 299%
1269.4 300% |
961.5 396%
959.4 397% |
766.0 498%
764.5 498% |
645.2 591%
641.5 593% |
557.5 684%
553.1 688% |
489.8 778%
485.0 785% |
469.4 812%
453.2 840% |
Test 4, Wall clock time and relative speedup |
508.5 100%
508.3 100% |
269.4 189%
268.2 190% |
268.8 189%
268.0 190% |
199.3 255%
198.6 256% |
157.7 322%
157.5 327% |
128.1 397%
127.7 398% |
123.8 411%
121.8 417% |
107.8 472%
106.2 479% |
92.8 548%
91.9 553% |
79.6 639%
79.3 641% |
Test 5, CPU time and relative speedup |
3185.7 100%
3178.3 100% |
1616.3 197%
1610.0 197% |
1687.3 189%
1674.9 190% |
1155.7 276%
1150.3 276% |
895.2 356%
889.3 357% |
726.6 438%
717.4 443% |
632.5 504%
624.5 509% |
574.0 555%
561.4 566% |
526.2 605%
511.4 621% |
656.2 485%
595.5 534% |
Test 6, CPU time and relative speedup |
12768.2 100%
12736.9 100% |
6433.2 198%
6406.3 199% |
6645.4 192%
6633.3 192% |
4492.0 284%
4472.6 285% |
3399.7 376%
3390.9 376% |
2723.8 469%
2715.6 469% |
2318.1 551%
2306.7 552% |
2014.5 634%
1994.7 639% |
1776.7 719%
1756.5 725% |
1795.9 711%
1708.1 746% |
Standard MP4(SDTQ) benchmark, Wall clock time and relative speedup |
4246.1 100%
4251.0 100% |
2193.1 194%
2167.3 196% |
2255.8 188%
2215.8 192% |
1556.2 273%
1556.7 273% |
1206.7 352%
1172.5 363% |
992.3 428%
970.7 438% |
859.1 494%
836.9 508% |
759.4 559%
743.2 572% |
677.3 627%
667.7 637% |
793.1 535%
737.2 576% |
Standard MCQDPT2 benchmark, Wall clock time and relative speedup |
|
|
|
|
80320.2 100%
79474.4 100% |
|
|
|
47017.7 171%
43764.8 182% |
40592.9 198%
40686.1 195% |
Hardware: dual Intel Quad-core Xeon W5580 (Gainestown/Xeon 5500, Nehalem core) 3.2 GHz, 12 GB (12 x 1 GB at 1066 MHz; configuration I) or 6 GB (6x1 GB at 1333 MHz; configuration II) DDR3 RAM, 140 GB RAID0 Volume (LSI 1078-based RAID controller, two 73 GB SAS HDDs).
OS: Windows Server 2008 Standard x64 Edition SP1.
Hyperthreading and Turbo Boost Technology were enabled in BIOS.
Test 1, single-point direct DFT (B3LYP) energy plus gradient for medium-size system (623 basis functions). View image
Test 2, single-point semiempirical (PM3) energy plus gradient for large system (540 atoms, 2160 basis functions). View image
Test 3, single-point direct MP2 energy for medium-size system (623 basis functions, the same system as one used for Test 1). View image
Test 4, single-point two-state MCQDPT2 energy with ISA energy denominators shift for small model system. View image
Test 5, single-point direct CASSCF(12,12) for medium-size system (retinal molecule, cc-pVDZ, 565 Cartesian basis functions) using ALDET code. View image
Test 6, single-point direct CIS energy plus gradient of first excited state of medium-size system (porphyrin molecule, cc-pVTZ (aug-cc on Nitrogens), 1130 Cartesian basis functions, D2h group). View image
Tests 2, 4, as well as standard MCQDPT2 and MP4 benchmarks were run in multithreaded mode, other tests were run in standard parallel mode using dynamic load balancing over p2p interface. MPI: Intel MPI version 3.1 using mpiexec.exe -genv I_MPI_WAIT_MODE 1 settings. Unless explicitly stated otherwise, all multithreaded benchmarks used only single logical processor of each allotted CPU core. Call64 switch was turned on for all tests for faster processing. Note that test 2 does not scale well mainly due to limitations of the Firefly's semiempirical code, while test 4 would scale much better for larger job. Test 5 is the most memory and communications intensive one. Test 2 and MP4(SDTQ) benchmark are mainly dgemm-limited and thus do not benefit from the use of HTT. CPU or Wall clock times are given on master node in seconds.
Gray numbers are for Configuration I, while red are for Configuration II with somewhat faster memory settings.
Columns marked as "(b)" contain data obtained using default OS scheduler.
Column "2 (a)" contains data with each individual physical processor assigned to the particular Firefly process
or working thread, and thus the effects due to activated Turbo Boost can be easily seen by comparing these data with those of column "2 (b)".
Hence, the data on scalability above are not generally quite correct, as all the tests running using only single (or two, "2 (a)" case)
CPU core(s) were executed at higher CPU clockspeed (3.333 GHz) than those using two ("2 (b)" case) or more CPU cores (3.2 GHz).
Unlike easily seen Turbo Boost-related speedup, we did not notice any serious negative impact on performance that can be attrubuted to the NUMA architecture of this computer system.
We are grateful to Alexey Belogortsev, Andrey Kudryavtsev, and Alexey Rogachkov (Intel Russia) for providing access to this system and technical support.
Press to visit Firefly version 8.0.0 beta Intel AVX-enabled Core i7 2600K benchmarks page
Press to visit PC GAMESS' Barcelona vs. Clovertown vs. Harpertown performance comparison page
Press to visit PC GAMESS' different eight core systems performance comparison page
Press to visit PC GAMESS' Woodcrest vs. Opteron performance comparison page
Press to visit PC GAMESS Pentium 4 family Xeon processor benchmarks page to compare the results of these benchmarks with those obtained on Xeon DP processors.
Press to visit Firefly v. 7.1.G AMD Quad-core Phenom II X4 955 Black Edition 3.2 GHz benchmarks page
Press to visit PC GAMESS/Firefly version 7.1.F Core i7 940 benchmarks page (Windows 64-bit, HTT and Turbo Boost Technology enabled)
Press to visit PC GAMESS/Firefly version 7.1.E Core i7 benchmarks page (Linux 64-bit, HTT and Turbo Boost Technology disabled)
Press to visit PC GAMESS/Firefly version 7.1.E Core 2 Quadro Q9550 (Yorkfield) benchmarks page
Press to visit PC GAMESS' Core 2 Quadro QX-6700 (Kentsfield) benchmarks page
Press to visit PC GAMESS Pentium 4 family benchmarks page to compare the results of these benchmarks with those obtained on various Netburst (Pentium 4 and Pentium D) processors.
Press to visit the PC GAMESS vs. WinGamess performance comparison page to compare the results of these benchmarks with those obtained on older processors. Input files can be found there too.