Ceilidh ... Re^3: Signal 7 for intelMPI version

Firefly and PC GAMESS-related discussion club

Learn how to ask questions correctly

We are NATO-free zone

Re^3: Signal 7 for intelMPI version

Alex Granovsky
gran@classic.chem.msu.su

Dear Andrey,

I'm sorry for delayed reply, I was very busy last weeks.

SIGBUS (Signal 7), unlike SIGSEGV is not a signal which is normally
caused by the bugs in user programs (some user errors in handling
memory-mapped files can cause it though).

With Firefly in on Lomonosov-1 cluster, SIGBUS is usually caused
when Firefly calls code from a shared library which is located on
a Lustre fs and its code segment has been discarded by OS kernel
due to lack of memory. In this situation, kernel attempts to reload
it again from Lustre but in case of failure the SIGBUS is raised.

Unfortunately Lustre is not very stable on this particular cluster.
Normally, simply ignoring SIGBUS, waiting for some small amount
of time, and re-executing the faulted code solves the problem and
this is what Firefly attempts to do. Unfortunately, on some buggy
nodes or nodes connected to malfunctioning switches, Lustre can
become inaccessible for hours or days, or until next reboot.
In this case, it may be wise to kill the job.

Unfortunately it is generally impossible to predict how long it take
to restore connection to Lustre and recover from I/O errors and this
is why Firefly presently does not terminate itself after some number
of retries. It leaves this up to human to decide.

Hope this helps.

Kind regards,
Alex Granovsky

On Thu Nov 22 '18 12:39pm, Andrey Degtyarev wrote
-------------------------------------------------
>I tried to add these options.

>in 1.5 hours, 35 GB stdout + strerr was accumulated with such content:
>...
>TID 3775 on rank 170 caught bogus SIGBUS.

>Dump of registers follows

>eax :: 0x57419e78, edx :: 0x00002000
>ecx :: 0x0000079e, ebx :: 0xffff7868
>esi :: 0x5cbaece8, edi :: 0x57418000
>ebp :: 0x00000000, esp :: 0xffff7760
>eip :: 0x5563f3b8, eflags :: 0x00010206

>cs :: 0x0023
>ds :: 0x002b
>es :: 0x002b
>ss :: 0x002b
>fs :: 0x00c7
>gs :: 0x0063
>
>
>Waiting 100 milliseconds and trying to resume.

>TID 3773 on rank 168 caught bogus SIGBUS.

>Dump of registers follows

>eax :: 0x56037e78, edx :: 0x0000cfc8
>ecx :: 0x00000b9e, ebx :: 0x00003000
>esi :: 0x5cc68ce8, edi :: 0x56038000
>ebp :: 0x00000000, esp :: 0xffff7690
>eip :: 0x5563bde8, eflags :: 0x00010202

>cs :: 0x0023
>ds :: 0x002b
>es :: 0x002b
>ss :: 0x002b
>fs :: 0x00c7
>gs :: 0x0063
>
>
>Waiting 100 milliseconds and trying to resume.
>...
>etc

>but out file not changed
>it's problem write to out file?

>On Wed Nov 21 '18 10:50pm, Alex Granovsky wrote
>-----------------------------------------------
>>Hello,
>>
>>
>>
>>The version of Lustre filesystem is not very robust on this cluster
>>causing bogus SIGBUS signals to appear randomly.

>>You need to add -buggyfs -lustre command-line options to the Firefly's
>>command line. This will try to workaround most of Lustre-realted bugs.

>>Kind regards,
>>Alex Granovsky
>>
>>
>>
>>
>>
>>
>>On Wed Nov 21 '18 2:04pm, Andrey Degtyarev wrote
>>------------------------------------------------
>>>When trying to calculate on a cluster Lomonosov-1, the program crashes a few minutes after running the task with signal 7.
>>>version Firefly: 8.2.0
>>>mpi: intelmpi/4.1.0-32bit

>>>dump files attached.

>>>

Mon Dec 17 '18 8:00pm

This message read 474 times