Firefly and PC GAMESS-related discussion club



Learn how to ask questions correctly


Re: Low CPU Utilization on 1 of cluster nodes.

Alex Granovsky
gran@classic.chem.msu.su


Dear Olga,

the most essential is the following information:


Timer #    5,    value :  2.87304690629600000D+12
Timer #    6,    value :  2.87304952912000000D+12

With Firefly 7.1.G, these two counters measure the
number of CPU cycles spent inside MPI_Barrier() MPI call.
Roughly speaking, running on the "bad" node Firefly lost
ca. 1000 seconds inside this call, while on the "good" one
this takes only ca. 15 seconds. Well, while this is clearly
not the Firefly's internal problem, there are still some
things to check, namely:

1. Is the "bad" node indeed the eight-core computer?
Note, you should check for physical, not logical cores.
Under Linux, the output of "cat /proc/cpuinfo" would be helpful.

2. Are there any additional processes left on the "bad" node by some previous runs and consuming CPU cycles?

3. What is the particular implementation of MPI you are using?
Whether it uses blocking calls or polling for small messages
and synchronization? If it uses polling, there could be some
issues with some MPI implementations that does not use proper
atomic primitives and memory barriers. This problem does not
always manifest itself, depending on the particular hardware
and esp. BIOS as it may contain microcode updates for CPUs and
chipset.  

4. Is the OS kernel and the BIOS version identical on both nodes?

Regards,
Alex










On Thu Oct 14 '10 1:29pm, Olga wrote
------------------------------------
>Hi!
>I have 2 nodes (4 and 5) on computer cluster to run Firefly, there are 8 cores on each node. For testing I use file, which calculation has taken 8.9 minutes on 3 cores (on other computer). For cluster I have:

>          CPU UTILIZATION (%) CPU TIME (min)  WALL CLOCK TIME(min)     
>8 cores:                                   
>node 4              9.71          4.9          50.3     
>node 5              101.62          5.1          5.0     
>16 cores - 2 nodes: 5.96          2.9          48.3     

>I.e. node 2 works good, node 1 has wery low CPU utilization, and 2 nodes together - still lower.
>I use direct calculation mode (dirscf=.t.).
>Below I write 2 time statistics for each nodes, obtained with -prof option. May be it will be usefull?

>Why does in happens?
>Thanks for advice!

>Time statistic on node 4:
> Nonzero profiling timers on node    0:
> Timer #    1,    value :  3.98379953600000000D+09
> Timer #    2,    value :  1.02167162720000000D+10
> Timer #    3,    value :  6.31929680000000000D+07
> Timer #    5,    value :  4.28207664720000000D+10
> Timer #    6,    value :  4.28249098720000000D+10
> Timer #   11,    value :  1.55933147920000000D+10
> Timer #   12,    value :  1.74584208000000000D+08
> Timer #  150,    value :  7.83297488000000000D+08
> Timer #  151,    value :  5.13802400000000000D+06
> Timer #  152,    value :  4.17138744000000000D+09
> Timer #  500,    value :  6.39728416000000000D+08
> Timer #  505,    value :  1.17644476800000000D+09
>
>
>
> Nonzero profiling timers on node    1:
> Timer #    1,    value :  6.48114602840000000D+11
> Timer #    2,    value :  6.47319688808000000D+11
> Timer #    3,    value :  5.50739360000000000D+07
> Timer #    5,    value :  2.87304690629600000D+12
> Timer #    6,    value :  2.87304952912000000D+12
> Timer #   11,    value :  1.89633350880000000D+10
> Timer #   12,    value :  1.72526344000000000D+08
> Timer #  150,    value :  3.80924904000000000D+08
> Timer #  151,    value :  1.81760000000000000D+05
> Timer #  152,    value :  7.07490928000000000D+09
> Timer #  500,    value :  4.43645200000000000D+07
> Timer #  505,    value :  3.09415384000000000D+08
>
>
>
> Nonzero profiling timers on node    2:
> Timer #    1,    value :  6.47107859528000000D+11
> Timer #    2,    value :  6.48616140112000000D+11
> Timer #    3,    value :  5.44181680000000000D+07
> Timer #    5,    value :  2.87336054361600000D+12
> Timer #    6,    value :  2.87336271536000000D+12
> Timer #   11,    value :  1.56407675040000000D+10
> Timer #   12,    value :  1.72809752000000000D+08
> Timer #  150,    value :  3.80728624000000000D+08
> Timer #  151,    value :  1.62240000000000000D+05
> Timer #  152,    value :  6.09675160000000000D+09
> Timer #  500,    value :  4.34736880000000000D+07
> Timer #  505,    value :  3.23904976000000000D+08
>
>
>
> Nonzero profiling timers on node    3:
> Timer #    1,    value :  6.47028113880000000D+11
> Timer #    2,    value :  6.51052175496000000D+11
> Timer #    3,    value :  3.72768720000000000D+07
> Timer #    5,    value :  2.87397743856000000D+12
> Timer #    6,    value :  2.87397981103200000D+12
> Timer #   11,    value :  1.56190340320000000D+10
> Timer #   12,    value :  1.72847840000000000D+08
> Timer #  150,    value :  3.82082600000000000D+08
> Timer #  151,    value :  1.65288000000000000D+05
> Timer #  152,    value :  5.97725121600000000D+09
> Timer #  500,    value :  4.86447840000000000D+07
> Timer #  505,    value :  3.28839616000000000D+08
>
>
>
> Nonzero profiling timers on node    4:
> Timer #    1,    value :  6.47900587840000000D+11
> Timer #    2,    value :  6.50744480472000000D+11
> Timer #    3,    value :  5.23454800000000000D+07
> Timer #    5,    value :  2.87357408602400000D+12
> Timer #    6,    value :  2.87357675466400000D+12
> Timer #   11,    value :  1.56338300560000000D+10
> Timer #   12,    value :  1.72403832000000000D+08
> Timer #  150,    value :  3.81443848000000000D+08
> Timer #  151,    value :  1.57056000000000000D+05
> Timer #  152,    value :  6.89809748800000000D+09
> Timer #  500,    value :  4.77209440000000000D+07
> Timer #  505,    value :  3.33102704000000000D+08
>
>
>
> Nonzero profiling timers on node    5:
> Timer #    1,    value :  6.47092377224000000D+11
> Timer #    2,    value :  6.49354949808000000D+11
> Timer #    3,    value :  4.46726240000000000D+07
> Timer #    5,    value :  2.87347369622400000D+12
> Timer #    6,    value :  2.87361473109600000D+12
> Timer #   11,    value :  1.56528093280000000D+10
> Timer #   12,    value :  1.72657624000000000D+08
> Timer #  150,    value :  3.83355128000000000D+08
> Timer #  151,    value :  1.69320000000000000D+05
> Timer #  152,    value :  6.03587908800000000D+09
> Timer #  500,    value :  4.56482720000000000D+07
> Timer #  505,    value :  3.14377104000000000D+08
>
>
>
> Nonzero profiling timers on node    6:
> Timer #    1,    value :  6.46981907336000000D+11
> Timer #    2,    value :  6.48787160208000000D+11
> Timer #    3,    value :  3.89721760000000000D+07
> Timer #    5,    value :  2.87178005160000000D+12
> Timer #    6,    value :  2.87178232033600000D+12
> Timer #   11,    value :  1.56449344160000000D+10
> Timer #   12,    value :  1.72784120000000000D+08
> Timer #  150,    value :  3.83596224000000000D+08
> Timer #  151,    value :  1.59504000000000000D+05
> Timer #  152,    value :  5.91703767200000000D+09
> Timer #  500,    value :  3.95701600000000000D+07
> Timer #  505,    value :  2.80391008000000000D+08
>
>
>
> Nonzero profiling timers on node    7:
> Timer #    1,    value :  6.47751587600000000D+11
> Timer #    2,    value :  6.50602128728000000D+11
> Timer #    3,    value :  3.53694880000000000D+07
> Timer #    5,    value :  2.87260665065600000D+12
> Timer #    6,    value :  2.87260886552000000D+12
> Timer #   11,    value :  1.56480665680000000D+10
> Timer #   12,    value :  1.72693912000000000D+08
> Timer #  150,    value :  3.83430080000000000D+08
> Timer #  151,    value :  1.76952000000000000D+05
> Timer #  152,    value :  6.69951108800000000D+09
> Timer #  500,    value :  4.83972880000000000D+07
> Timer #  505,    value :  3.34282288000000000D+08

>On node 5:
> Nonzero profiling timers on node    0:
> Timer #    1,    value :  4.68144312800000000D+09
> Timer #    2,    value :  1.42523975040000000D+10
> Timer #    3,    value :  1.80829424000000000D+08
> Timer #    5,    value :  4.67983424640000000D+10
> Timer #    6,    value :  4.68026161280000000D+10
> Timer #   11,    value :  1.56220019280000000D+10
> Timer #   12,    value :  1.75274320000000000D+08
> Timer #  150,    value :  7.59579552000000000D+08
> Timer #  151,    value :  4.67166400000000000D+06
> Timer #  152,    value :  4.83085481600000000D+09
> Timer #  500,    value :  6.17098824000000000D+08
> Timer #  505,    value :  1.20245706400000000D+09
>
>
>
> Nonzero profiling timers on node    1:
> Timer #    1,    value :  8.77942285600000000D+09
> Timer #    2,    value :  9.23579720000000000D+09
> Timer #    3,    value :  1.60882848000000000D+08
> Timer #    5,    value :  3.13111602880000000D+10
> Timer #    6,    value :  3.13542393760000000D+10
> Timer #   11,    value :  1.89882810800000000D+10
> Timer #   12,    value :  1.72622344000000000D+08
> Timer #  150,    value :  3.69860296000000000D+08
> Timer #  151,    value :  1.71312000000000000D+05
> Timer #  152,    value :  7.64218865600000000D+09
> Timer #  500,    value :  4.51941120000000000D+07
> Timer #  505,    value :  3.19343312000000000D+08
>
>
>
> Nonzero profiling timers on node    2:
> Timer #    1,    value :  7.31887372000000000D+09
> Timer #    2,    value :  9.10029755200000000D+09
> Timer #    3,    value :  1.65496320000000000D+08
> Timer #    5,    value :  3.16839030080000000D+10
> Timer #    6,    value :  3.16852723440000000D+10
> Timer #   11,    value :  1.56542907440000000D+10
> Timer #   12,    value :  1.72900840000000000D+08
> Timer #  150,    value :  3.69546976000000000D+08
> Timer #  151,    value :  1.62872000000000000D+05
> Timer #  152,    value :  6.03097576800000000D+09
> Timer #  500,    value :  4.26484400000000000D+07
> Timer #  505,    value :  3.18286800000000000D+08
>
>
>
> Nonzero profiling timers on node    3:
> Timer #    1,    value :  8.23967892000000000D+09
> Timer #    2,    value :  1.35622884480000000D+10
> Timer #    3,    value :  4.12098080000000000D+07
> Timer #    5,    value :  2.97561857120000000D+10
> Timer #    6,    value :  2.97583092960000000D+10
> Timer #   11,    value :  1.56355312880000000D+10
> Timer #   12,    value :  1.73542256000000000D+08
> Timer #  150,    value :  3.70329432000000000D+08
> Timer #  151,    value :  1.77216000000000000D+05
> Timer #  152,    value :  6.86189634400000000D+09
> Timer #  500,    value :  4.23680400000000000D+07
> Timer #  505,    value :  3.24100664000000000D+08
>
>
>
> Nonzero profiling timers on node    4:
> Timer #    1,    value :  7.93580033600000000D+09
> Timer #    2,    value :  1.35766439600000000D+10
> Timer #    3,    value :  1.68802704000000000D+08
> Timer #    5,    value :  3.09905246240000000D+10
> Timer #    6,    value :  3.09928242480000000D+10
> Timer #   11,    value :  1.56434374160000000D+10
> Timer #   12,    value :  1.72741880000000000D+08
> Timer #  150,    value :  3.68176432000000000D+08
> Timer #  151,    value :  1.91408000000000000D+05
> Timer #  152,    value :  6.68788316800000000D+09
> Timer #  500,    value :  4.63780400000000000D+07
> Timer #  505,    value :  3.34973616000000000D+08
>
>
>
> Nonzero profiling timers on node    5:
> Timer #    1,    value :  7.55875089600000000D+09
> Timer #    2,    value :  9.43258139200000000D+09
> Timer #    3,    value :  1.58477720000000000D+08
> Timer #    5,    value :  3.01079203120000000D+10
> Timer #    6,    value :  3.01104920720000000D+10
> Timer #   11,    value :  1.56380862000000000D+10
> Timer #   12,    value :  1.72755544000000000D+08
> Timer #  150,    value :  3.68835688000000000D+08
> Timer #  151,    value :  1.70496000000000000D+05
> Timer #  152,    value :  6.20219310400000000D+09
> Timer #  500,    value :  3.97796960000000000D+07
> Timer #  505,    value :  3.19533856000000000D+08
>
>
>
> Nonzero profiling timers on node    6:
> Timer #    1,    value :  8.05604827200000000D+09
> Timer #    2,    value :  1.05200348320000000D+10
> Timer #    3,    value :  1.56448592000000000D+08
> Timer #    5,    value :  3.06405022480000000D+10
> Timer #    6,    value :  3.06421998720000000D+10
> Timer #   11,    value :  1.56488893040000000D+10
> Timer #   12,    value :  1.72599936000000000D+08
> Timer #  150,    value :  3.70428432000000000D+08
> Timer #  151,    value :  1.66768000000000000D+05
> Timer #  152,    value :  6.75252024000000000D+09
> Timer #  500,    value :  3.81849680000000000D+07
> Timer #  505,    value :  2.86943168000000000D+08
>
>
>
> Nonzero profiling timers on node    7:
> Timer #    1,    value :  8.55401868800000000D+09
> Timer #    2,    value :  1.34026095520000000D+10
> Timer #    3,    value :  1.30647832000000000D+08
> Timer #    5,    value :  3.03974074480000000D+10
> Timer #    6,    value :  3.04403809840000000D+10
> Timer #   11,    value :  1.56436012880000000D+10
> Timer #   12,    value :  1.72435216000000000D+08
> Timer #  150,    value :  3.70845816000000000D+08
> Timer #  151,    value :  1.71560000000000000D+05
> Timer #  152,    value :  7.14388900800000000D+09
> Timer #  500,    value :  4.68506400000000000D+07
> Timer #  505,    value :  3.24921264000000000D+08
>
>
>
>
>
>

>

>


[ Previous ] [ Next ] [ Index ]           Sat Oct 16 '10 6:26pm
[ Reply ] [ Edit ] [ Delete ]           This message read 814 times