Ceilidh ... Re^8: More advanced problem

Firefly and PC GAMESS-related discussion club

Re^8: More advanced problem – p2p fails

Hi,
Alex, thank you very much for detail comments! I'll try to test cluster with your option.
However, I don't understand what means "x" in 0x400 etc? Is it 0 or 1 or …? If here number 1024 is in hex format then what is "x"? Whether I should vary number in this position?
Somebody can explain to me sense of this "x"?
Thanks!

P.S. Just I have received the first results with mxgsum=00400. Performance has changed: on 16, 24, 32 cores one is 161, 151, 117%. That is, on 24 cores performance increases but strongly decreases on 32 cores.
I continue to test...

-------------------------------------------------------
On Thu Apr 1 '10 7:25pm, Alex Granovsky wrote
---------------------------------------------
>Hi Vyacheslav,

>thanks for providing output files. They are really very informative.
>It would be fine to have this information from the start.

>First, as you can see, the SCF part scales very differently as
>compared with MP2 stage. SCF uses MPI for communications while
>MP2 uses P2P interface. In your example, SCF scales very poorly,
>while MP2 itself scales quite well; thus the problem is with MPI
>rather than P2P.

>The cryptic numbers at the end of outputs are just the overall
>number of CPU clocks spent in various parts of program.
>Of them, counters 1-10 are currently defined for communications.

>if you compare 16-core and 32-core runs, you'll see:

 16 cores (8 cores x 2 boxes)

 Nonzero profiling timers on node    0:
 Timer #    1,    value :  6.00125988100000000D+09
 Timer #    2,    value :  4.02343449660000000D+10
 Timer #    3,    value :  7.99721350000000000D+07
 Timer #    5,    value :  5.55376825400000000D+09
 Timer #    6,    value :  5.54354738900000000D+09
 Timer #   11,    value :  2.84021331600000000D+09
 Timer #   12,    value :  1.02876740200000000D+09
 Timer #  150,    value :  5.90478401700000000D+09
 Timer #  151,    value :  1.74074892000000000D+08
 Timer #  152,    value :  1.49336411300000000D+09
 Timer #  500,    value :  4.81655174580000000D+10
 Timer #  505,    value :  1.79228612600000000D+09

 32 cores (8 cores x 4 boxes)

 Nonzero profiling timers on node    0:
 Timer #    1,    value :  9.06232245540000000D+10
 Timer #    2,    value :  4.76818422955000000D+11
 Timer #    3,    value :  3.90484858000000000D+08
 Timer #    5,    value :  7.09174915700000000D+09
 Timer #    6,    value :  7.07240461100000000D+09
 Timer #   11,    value :  2.72791984200000000D+09
 Timer #   12,    value :  1.00909734800000000D+09
 Timer #  150,    value :  5.01572701400000000D+09
 Timer #  151,    value :  1.20019209000000000D+08
 Timer #  152,    value :  1.50803634400000000D+09
 Timer #  500,    value :  5.70288313900000000D+09
 Timer #  505,    value :  1.84421787000000000D+09

>The most notable difference is with counter #2 that corresponds
>to MPI_Allreduce() calls. In particular 4.76818422955000000D+11
>CPU clocks means 168.4 seconds spent inside MS MPI (according
>to output CPU frequency is 2.83 GHz) performing MPI_Allreduce().

>Actually, this call is primary used by SCF code to sum up and
>gather the completed Fock operator, and is used once per SCF
>iteration (and there are SCF 15 iterations). This means that each
>call consumed more than 10 seconds. What is interesting, is that
>the amount of data for Allreduce is not very large - namely,
>it is just 4*N*(N+1) bytes, where N is the number of Cartesian
>AOs, i.e. it is ca. 1.5 MB per each of 32 cores.

>Thus, the bottleneck is probably not related with the network
>cards. It does could be related with either low-quality switch,
>or some incompatibility between NICs settings and switch (e.g,
>check if Jumbo packets are supported etc...). However, this
>seems not to be very likely.

>Finally, a couple of suggestions to try:

 $mpi 
   mxgsum=0x400 ! i.e., mxgsum=1024 in decimal 
 $end 

 $mpi 
   mxgsum=0x800
 $end 

 $mpi 
   mxgsum=0x1000
 $end 

 $mpi 
   mxgsum=0x2000
 $end 

 $mpi 
   mxgsum=0x4000
 $end 

etc ...

>to find the optimal value of mxgsum (the size, in 8 byte words,
>of the atomic message for MPI_Allreduce operation). Most likely,
>this should help to tune performance. If it does not, check if
>MS MPI is working in Network Direct mode, and if it is not,
>try to find the reason.

>Regards,
>Alex Granovsky

[ This message was edited on Fri Apr 2 '10 at 11:41am by the author ]

Fri Apr 2 '10 11:41am

This message read 755 times