thanks for providing output files. They are really very informative.
It would be fine to have this information from the start.
First, as you can see, the SCF part scales very differently as
compared with MP2 stage. SCF uses MPI for communications while
MP2 uses P2P interface. In your example, SCF scales very poorly,
while MP2 itself scales quite well; thus the problem is with MPI
rather than P2P.
The cryptic numbers at the end of outputs are just the overall
number of CPU clocks spent in various parts of program.
Of them, counters 1-10 are currently defined for communications.
if you compare 16-core and 32-core runs, you'll see:
16 cores (8 cores x 2 boxes) Nonzero profiling timers on node 0: Timer # 1, value : 6.00125988100000000D+09 Timer # 2, value : 4.02343449660000000D+10 Timer # 3, value : 7.99721350000000000D+07 Timer # 5, value : 5.55376825400000000D+09 Timer # 6, value : 5.54354738900000000D+09 Timer # 11, value : 2.84021331600000000D+09 Timer # 12, value : 1.02876740200000000D+09 Timer # 150, value : 5.90478401700000000D+09 Timer # 151, value : 1.74074892000000000D+08 Timer # 152, value : 1.49336411300000000D+09 Timer # 500, value : 4.81655174580000000D+10 Timer # 505, value : 1.79228612600000000D+09
32 cores (8 cores x 4 boxes) Nonzero profiling timers on node 0: Timer # 1, value : 9.06232245540000000D+10 Timer # 2, value : 4.76818422955000000D+11 Timer # 3, value : 3.90484858000000000D+08 Timer # 5, value : 7.09174915700000000D+09 Timer # 6, value : 7.07240461100000000D+09 Timer # 11, value : 2.72791984200000000D+09 Timer # 12, value : 1.00909734800000000D+09 Timer # 150, value : 5.01572701400000000D+09 Timer # 151, value : 1.20019209000000000D+08 Timer # 152, value : 1.50803634400000000D+09 Timer # 500, value : 5.70288313900000000D+09 Timer # 505, value : 1.84421787000000000D+09
The most notable difference is with counter #2 that corresponds
to MPI_Allreduce() calls. In particular 4.76818422955000000D+11
CPU clocks means 168.4 seconds spent inside MS MPI (according
to output CPU frequency is 2.83 GHz) performing MPI_Allreduce().
Actually, this call is primary used by SCF code to sum up and
gather the completed Fock operator, and is used once per SCF
iteration (and there are SCF 15 iterations). This means that each
call consumed more than 10 seconds. What is interesting, is that
the amount of data for Allreduce is not very large - namely,
it is just 4*N*(N+1) bytes, where N is the number of Cartesian
AOs, i.e. it is ca. 1.5 MB per each of 32 cores.
Thus, the bottleneck is probably not related with the network
cards. It does could be related with either low-quality switch,
or some incompatibility between NICs settings and switch (e.g,
check if Jumbo packets are supported etc...). However, this
seems not to be very likely.
Finally, a couple of suggestions to try:
$mpi mxgsum=0x400 ! i.e., mxgsum=1024 in decimal $end $mpi mxgsum=0x800 $end $mpi mxgsum=0x1000 $end $mpi mxgsum=0x2000 $end $mpi mxgsum=0x4000 $end etc ...
to find the optimal value of mxgsum (the size, in 8 byte words,
of the atomic message for MPI_Allreduce operation). Most likely,
this should help to tune performance. If it does not, check if
MS MPI is working in Network Direct mode, and if it is not,
try to find the reason.
On Thu Apr 1 '10 2:39pm, Vyacheslav wrote
>Alexei, thanks for your hints. I've run a bigger task on the cluster (about 1 hour on 8 cores, UHF optimize task) but results are even worse than on a standard MP2 task. Performance for 16, 24 and 32 cores runs are 176, 67 and 95espectively. CPU utilization on 24 and 32 cores are about 24-27or this bigger task. Network loading is commonly less than 7-8ÐPeak loads (very short) are about 50ÐThat is the network almost does not work - as well as CPUs.
> We have changed such parameter of switch tuning as Link Speed Duplex (from Auto to Full duplex 1000 Mbps) but it has not helped…
> As to adapters I am badly guided in these things. I have on all nodes INTEL PWLA 8492 MT 2xUTP 10/100/1000Mb, PCI. I think it is not bad thing or I am not right?
> My switches are of average quality - D-Link DGS-1008D/GE 8хUTP 10/100/1000Mb. I try to use your hint related 16- 24-ports switches if I'll find them at somebody (they are rather expensive…). However it seems the reason should be another – network basically sleeps now.
> Granovsky has advised me to apply outputs for some run. OK, I do it for MP2 task in the most effective variant. Please, look. At the end of outputs there is any time statistics since the task has been run with an option -prof in a command line. However, I do not know the sense of these values. I've taken the input for this run from Performance section of FF site (Test 3).
>If any data are required, I am ready to give them.
>Many thanks for your helps!
>On Wed Mar 31 '10 10:54pm, Alexei Popov wrote
>>it seems you are using the benchmark that is simply
>>too small for your cluster. If you look here,
>>you'll find that it takes ca. 400-500 seconds to complete on 8 cores.
>>Actually, with the latest processors Firefly seems to need the
>>updated set of benchmarks, at least for parallel runs.
>>I'd suggest you to test performance and scalability of your
>>particular cluster/setup using job that takes at least 1-2 hours
>>to complete on 8 cores running on single node.
>>Your nodes are fast, and my guess is you are using 1 Gbit Ethernet?
>>This is most likely optimal solution for such a small cluster,
>>at least the price/performance ratio is very reasonable.
>>However, you need high quality Ethernet adapters and really
>>good switch. Formally you need 8-port switch but I doubt
>>if a typical 8-port switch will be good enough for your purposes -
>>so you can try to experiment with 16 or 24-port models.