Ceilidh ... Re^4: Firefly Bug

Firefly and PC GAMESS-related discussion club

Re^4: Firefly Bug

Hi,

you just need to upgrade your OS:

BUGS
       When a process terminates, its set of associated semadj  structures  is
       used to undo the effect of all of the semaphore operations it performed
       with the SEM_UNDO flag.  This raises a difficulty: if one (or more)  of
       these  semaphore  adjustments  would result in an attempt to decrease a
       semaphore's value below zero, what should an  implementation  do?   One
       possible approach would be to block until all the semaphore adjustments
       could be performed.  This is however undesirable since it  could  force
       process  termination  to  block  for arbitrarily long periods.  Another
       possibility is that such semaphore adjustments could be  ignored  alto-
       gether  (somewhat  analogously  to failing when IPC_NOWAIT is specified
       for a semaphore operation).  Linux adopts a third approach:  decreasing
       the  semaphore  value  as  far as possible (i.e., to zero) and allowing
       process termination to proceed immediately.


       In kernels 2.6.x, x <= 10, there is a bug that  in  some  circumstances
       prevents a process that is waiting for a semaphore value to become zero
       from being woken up when the value does actually become zero.  This bug
       is fixed in kernel 2.6.11.

CONFORMING TO
       SVr4, POSIX.1-2001.

regards,
Alex Granovsky

On Sat Jan 30 '10 10:31pm, alex wrote
-------------------------------------
>Here is data of node I just kill deadlocked firefly:
>>{qi0027}~> cat /proc/version
>Linux version 2.6.9-67.0.7.ELlargesmp (brewbuilder@hs20-bc1-5.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-9)) #1 SMP Wed Feb 27 04:57:28 EST 2008
>{qi0027}~> cat /etc/issue
>Red Hat Enterprise Linux WS release 4 (Nahant Update 6)
>
>
>
>On Sat Jan 30 '10 10:02pm, alex wrote
>-------------------------------------
>>My OS is CentOS5.
>>There is no input/output as the bug is non-reproducible, i.e. the same input on the same machine run normally when started again (after killing deadlock in say infinite loop script with one job). The hyper-threading is off on our cluster but I use it 'on' on my workstation (CentOS5) where haven't met the bug (there is only 1 firefly on 2-CPUs as 2+2 with hyper-threading slightly slowdown calculations). I think, however, that developers can fix a problem adding an explicit initialization of all semaphores (I use dlb but within the same node only) in SCF routines.
>>P.S. the only significant difference I can find between clusters nodes and my workstation is a RAID that is absent on nodes. Cant it be IO races in SCF initialization?
>>Alex.
>

[ This message was edited on Sun Jan 31 '10 at 2:13pm by the author ]

Sun Jan 31 '10 2:13pm

This message read 711 times