Firefly and PC GAMESS-related discussion club

Learn how to ask questions correctly

Re^9: More advanced problem p2p fails

Alex Granovsky


this is just C language-like syntax for hexadecimal constants.
The 0x is just a prefix to mark hexadecimal input.
In particular, 0x400 means 400 hexadecimal that is 1024 decimal.
Firefly accepts also 0o (for octal) and 0b (for binary) constants.


On Fri Apr 2 '10 10:21am, Vyacheslav wrote
>Alex, thank you very much for detail comments! I'll try to test cluster with your option.
>However, I don't understand what means "x" in 0x400 etc? Is it 0 or 1 or ? If here number 1024 is in hex format then what is "x"? Whether I should vary number in this position?
>Somebody can explain to me sense of this "x"?

>On Thu Apr 1 '10 7:25pm, Alex Granovsky wrote
>>Hi Vyacheslav,

>>thanks for providing output files. They are really very informative.
>>It would be fine to have this information from the start.

>>First, as you can see, the SCF part scales very differently as
>>compared with MP2 stage. SCF uses MPI for communications while
>>MP2 uses P2P interface. In your example, SCF scales very poorly,
>>while MP2 itself scales quite well; thus the problem is with MPI
>>rather than P2P.

>>The cryptic numbers at the end of outputs are just the overall
>>number of CPU clocks spent in various parts of program.
>>Of them, counters 1-10 are currently defined for communications.

>>if you compare 16-core and 32-core runs, you'll see:


 16 cores (8 cores x 2 boxes)

 Nonzero profiling timers on node    0:
 Timer #    1,    value :  6.00125988100000000D+09
 Timer #    2,    value :  4.02343449660000000D+10
 Timer #    3,    value :  7.99721350000000000D+07
 Timer #    5,    value :  5.55376825400000000D+09
 Timer #    6,    value :  5.54354738900000000D+09
 Timer #   11,    value :  2.84021331600000000D+09
 Timer #   12,    value :  1.02876740200000000D+09
 Timer #  150,    value :  5.90478401700000000D+09
 Timer #  151,    value :  1.74074892000000000D+08
 Timer #  152,    value :  1.49336411300000000D+09
 Timer #  500,    value :  4.81655174580000000D+10
 Timer #  505,    value :  1.79228612600000000D+09


 32 cores (8 cores x 4 boxes)

 Nonzero profiling timers on node    0:
 Timer #    1,    value :  9.06232245540000000D+10
 Timer #    2,    value :  4.76818422955000000D+11
 Timer #    3,    value :  3.90484858000000000D+08
 Timer #    5,    value :  7.09174915700000000D+09
 Timer #    6,    value :  7.07240461100000000D+09
 Timer #   11,    value :  2.72791984200000000D+09
 Timer #   12,    value :  1.00909734800000000D+09
 Timer #  150,    value :  5.01572701400000000D+09
 Timer #  151,    value :  1.20019209000000000D+08
 Timer #  152,    value :  1.50803634400000000D+09
 Timer #  500,    value :  5.70288313900000000D+09
 Timer #  505,    value :  1.84421787000000000D+09

>>The most notable difference is with counter #2 that corresponds
>>to MPI_Allreduce() calls. In particular 4.76818422955000000D+11
>>CPU clocks means 168.4 seconds spent inside MS MPI (according
>>to output CPU frequency is 2.83 GHz) performing MPI_Allreduce().

>>Actually, this call is primary used by SCF code to sum up and
>>gather the completed Fock operator, and is used once per SCF
>>iteration (and there are SCF 15 iterations). This means that each
>>call consumed more than 10 seconds. What is interesting, is that
>>the amount of data for Allreduce is not very large - namely,
>>it is just 4*N*(N+1) bytes, where N is the number of Cartesian
>>AOs, i.e. it is ca. 1.5 MB per each of 32 cores.

>>Thus, the bottleneck is probably not related with the network
>>cards. It does could be related with either low-quality switch,
>>or some incompatibility between NICs settings and switch (e.g,
>>check if Jumbo packets are supported etc...). However, this
>>seems not to be very likely.

>>Finally, a couple of suggestions to try:


   mxgsum=0x400 ! i.e., mxgsum=1024 in decimal 





etc ...

>>to find the optimal value of mxgsum (the size, in 8 byte words,
>>of the atomic message for MPI_Allreduce operation). Most likely,
>>this should help to tune performance. If it does not, check if
>>MS MPI is working in Network Direct mode, and if it is not,
>>try to find the reason.

>>Alex Granovsky

[ Previous ] [ Next ] [ Index ]           Fri Apr 2 '10 11:00am
[ Reply ] [ Edit ] [ Delete ]           This message read 621 times