P.S. Just I have received the first results with mxgsum=00400. Performance has changed: on 16, 24, 32 cores one is 161, 151, 117%. That is, on 24 cores performance increases but strongly decreases on 32 cores.
I continue to test...
On Thu Apr 1 '10 7:25pm, Alex Granovsky wrote
>thanks for providing output files. They are really very informative.
>It would be fine to have this information from the start.
>First, as you can see, the SCF part scales very differently as
>compared with MP2 stage. SCF uses MPI for communications while
>MP2 uses P2P interface. In your example, SCF scales very poorly,
>while MP2 itself scales quite well; thus the problem is with MPI
>rather than P2P.
>The cryptic numbers at the end of outputs are just the overall
>number of CPU clocks spent in various parts of program.
>Of them, counters 1-10 are currently defined for communications.
>if you compare 16-core and 32-core runs, you'll see:
16 cores (8 cores x 2 boxes) Nonzero profiling timers on node 0: Timer # 1, value : 6.00125988100000000D+09 Timer # 2, value : 4.02343449660000000D+10 Timer # 3, value : 7.99721350000000000D+07 Timer # 5, value : 5.55376825400000000D+09 Timer # 6, value : 5.54354738900000000D+09 Timer # 11, value : 2.84021331600000000D+09 Timer # 12, value : 1.02876740200000000D+09 Timer # 150, value : 5.90478401700000000D+09 Timer # 151, value : 1.74074892000000000D+08 Timer # 152, value : 1.49336411300000000D+09 Timer # 500, value : 4.81655174580000000D+10 Timer # 505, value : 1.79228612600000000D+09
32 cores (8 cores x 4 boxes) Nonzero profiling timers on node 0: Timer # 1, value : 9.06232245540000000D+10 Timer # 2, value : 4.76818422955000000D+11 Timer # 3, value : 3.90484858000000000D+08 Timer # 5, value : 7.09174915700000000D+09 Timer # 6, value : 7.07240461100000000D+09 Timer # 11, value : 2.72791984200000000D+09 Timer # 12, value : 1.00909734800000000D+09 Timer # 150, value : 5.01572701400000000D+09 Timer # 151, value : 1.20019209000000000D+08 Timer # 152, value : 1.50803634400000000D+09 Timer # 500, value : 5.70288313900000000D+09 Timer # 505, value : 1.84421787000000000D+09
>The most notable difference is with counter #2 that corresponds
>to MPI_Allreduce() calls. In particular 4.76818422955000000D+11
>CPU clocks means 168.4 seconds spent inside MS MPI (according
>to output CPU frequency is 2.83 GHz) performing MPI_Allreduce().
>Actually, this call is primary used by SCF code to sum up and
>gather the completed Fock operator, and is used once per SCF
>iteration (and there are SCF 15 iterations). This means that each
>call consumed more than 10 seconds. What is interesting, is that
>the amount of data for Allreduce is not very large - namely,
>it is just 4*N*(N+1) bytes, where N is the number of Cartesian
>AOs, i.e. it is ca. 1.5 MB per each of 32 cores.
>Thus, the bottleneck is probably not related with the network
>cards. It does could be related with either low-quality switch,
>or some incompatibility between NICs settings and switch (e.g,
>check if Jumbo packets are supported etc...). However, this
>seems not to be very likely.
>Finally, a couple of suggestions to try:
$mpi mxgsum=0x400 ! i.e., mxgsum=1024 in decimal $end $mpi mxgsum=0x800 $end $mpi mxgsum=0x1000 $end $mpi mxgsum=0x2000 $end $mpi mxgsum=0x4000 $end etc ...
>to find the optimal value of mxgsum (the size, in 8 byte words,
>of the atomic message for MPI_Allreduce operation). Most likely,
>this should help to tune performance. If it does not, check if
>MS MPI is working in Network Direct mode, and if it is not,
>try to find the reason.
[ This message was edited on Fri Apr 2 '10 at 11:41am by the author ]