Firefly and PC GAMESS-related discussion club



Learn how to ask questions correctly


Re^9: More advanced problem – p2p fails

Alex Granovsky
gran@classic.chem.msu.su


Hi,

this is just C language-like syntax for hexadecimal constants.
The 0x is just a prefix to mark hexadecimal input.
In particular, 0x400 means 400 hexadecimal that is 1024 decimal.
Firefly accepts also 0o (for octal) and 0b (for binary) constants.

Regards,
Alex



On Fri Apr 2 '10 10:21am, Vyacheslav wrote
------------------------------------------
>Hi,
>Alex, thank you very much for detail comments! I'll try to test cluster with your option.
>However, I don't understand what means "x" in 0x400 etc? Is it 0 or 1 or …? If here number 1024 is in hex format then what is "x"? Whether I should vary number in this position?
>Somebody can explain to me sense of this "x"?
>Thanks!

>-------------------------------------------------------
>On Thu Apr 1 '10 7:25pm, Alex Granovsky wrote
>---------------------------------------------
>>Hi Vyacheslav,

>>thanks for providing output files. They are really very informative.
>>It would be fine to have this information from the start.

>>First, as you can see, the SCF part scales very differently as
>>compared with MP2 stage. SCF uses MPI for communications while
>>MP2 uses P2P interface. In your example, SCF scales very poorly,
>>while MP2 itself scales quite well; thus the problem is with MPI
>>rather than P2P.

>>The cryptic numbers at the end of outputs are just the overall
>>number of CPU clocks spent in various parts of program.
>>Of them, counters 1-10 are currently defined for communications.

>>if you compare 16-core and 32-core runs, you'll see:

>>

 16 cores (8 cores x 2 boxes)

 Nonzero profiling timers on node    0:
 Timer #    1,    value :  6.00125988100000000D+09
 Timer #    2,    value :  4.02343449660000000D+10
 Timer #    3,    value :  7.99721350000000000D+07
 Timer #    5,    value :  5.55376825400000000D+09
 Timer #    6,    value :  5.54354738900000000D+09
 Timer #   11,    value :  2.84021331600000000D+09
 Timer #   12,    value :  1.02876740200000000D+09
 Timer #  150,    value :  5.90478401700000000D+09
 Timer #  151,    value :  1.74074892000000000D+08
 Timer #  152,    value :  1.49336411300000000D+09
 Timer #  500,    value :  4.81655174580000000D+10
 Timer #  505,    value :  1.79228612600000000D+09

>>

 32 cores (8 cores x 4 boxes)

 Nonzero profiling timers on node    0:
 Timer #    1,    value :  9.06232245540000000D+10
 Timer #    2,    value :  4.76818422955000000D+11
 Timer #    3,    value :  3.90484858000000000D+08
 Timer #    5,    value :  7.09174915700000000D+09
 Timer #    6,    value :  7.07240461100000000D+09
 Timer #   11,    value :  2.72791984200000000D+09
 Timer #   12,    value :  1.00909734800000000D+09
 Timer #  150,    value :  5.01572701400000000D+09
 Timer #  151,    value :  1.20019209000000000D+08
 Timer #  152,    value :  1.50803634400000000D+09
 Timer #  500,    value :  5.70288313900000000D+09
 Timer #  505,    value :  1.84421787000000000D+09

>>The most notable difference is with counter #2 that corresponds
>>to MPI_Allreduce() calls. In particular 4.76818422955000000D+11
>>CPU clocks means 168.4 seconds spent inside MS MPI (according
>>to output CPU frequency is 2.83 GHz) performing MPI_Allreduce().

>>Actually, this call is primary used by SCF code to sum up and
>>gather the completed Fock operator, and is used once per SCF
>>iteration (and there are SCF 15 iterations). This means that each
>>call consumed more than 10 seconds. What is interesting, is that
>>the amount of data for Allreduce is not very large - namely,
>>it is just 4*N*(N+1) bytes, where N is the number of Cartesian
>>AOs, i.e. it is ca. 1.5 MB per each of 32 cores.

>>Thus, the bottleneck is probably not related with the network
>>cards. It does could be related with either low-quality switch,
>>or some incompatibility between NICs settings and switch (e.g,
>>check if Jumbo packets are supported etc...). However, this
>>seems not to be very likely.

>>Finally, a couple of suggestions to try:

>>

 $mpi 
   mxgsum=0x400 ! i.e., mxgsum=1024 in decimal 
 $end 

 $mpi 
   mxgsum=0x800
 $end 

 $mpi 
   mxgsum=0x1000
 $end 

 $mpi 
   mxgsum=0x2000
 $end 

 $mpi 
   mxgsum=0x4000
 $end 

etc ...

>>to find the optimal value of mxgsum (the size, in 8 byte words,
>>of the atomic message for MPI_Allreduce operation). Most likely,
>>this should help to tune performance. If it does not, check if
>>MS MPI is working in Network Direct mode, and if it is not,
>>try to find the reason.

>>Regards,
>>Alex Granovsky
>


[ Previous ] [ Next ] [ Index ]           Fri Apr 2 '10 11:00am
[ Reply ] [ Edit ] [ Delete ]           This message read 637 times