Firefly and PC GAMESS-related discussion club



Learn how to ask questions correctly


Re^3: PC GAMESS @ Lustre

Alex Granovsky
gran@classic.chem.msu.su


Dear Maryna,

My present guess is as follows.

All the low-level I/O routines used by Firefly are EINTR (Interrupted system call) safe. However, this only applies to the calls that can return EINTR according to POSIX standards.
In paricular, this does not apply to lseek() as it is supposed it never returns EINTR.

The I/O calls used by MP2 code are write(), lseeek(), and read(), and the only of them that is not protected is lseeek().

Frankly, we did not ever meet this problem before - however it seems to me that Lustre (unlike any other filesystems we tested so far) may interrupt lseeek() calls.
We can create a special build of Firefly for you to check for and fix this problem. Meantime, the solution would be to disable all the async i/o features of MP2 gradient code (ASYNC and XASYNC of $mp2grd). If this does not help the next step would be to use the special build of p2p library that does not use signals.  

However, the use of local filesystems is strongly preferred for better performance.

Regards,
Alex Granovsky


On Wed Dec 16 '09 8:23pm, Maryna V. Krasovska wrote
---------------------------------------------------
>Dear Alex,

>Thanks for reply.

>1. firefly_71g_linux_impi_p4

>2. Seems this does not matter.

> Intel Core2/ Linux �Firefly version running under Linux.
> Running on Intel CPU: �Brand ID �0, Family �6, Model �23, Stepping �6
> CPU Brand String � �: �Intel(R) Xeon(R) CPU � � � � � X5460 �@ 3.16GHz
> CPU Features � � � �: �CMOV, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, MWAIT, EM64T
> Data cache size � � : �L1 32 KB, L2 6144 KB, L3 � � 0 KB
> max � �# of � cores/package : � 4
> max � �# of threads/package : � 4
> max � � cache sharing level : � 2
> Operating System successfully passed SSE support test.

>PARALLEL VERSION (INTEL MPI) RUNNING USING � �8 PROCESSES (NODES)

>My run script:
>PROG_PATH=/share/apps/firefly/
>PROG_BIN=firefly
>PROG=$PROG_PATH/$PROG_BIN
>SCRATCH=/ptmp/scratch/pbstmp.$PBS_JOBID
>MPIEXEC=/opt/mpiexec/bin/mpiexec
>source /share/apps/intel/impi/bin/mpivars.sh
>cd $SCRATCH
>/opt/mpiexec/bin/mpiexec �-v -kill -comm pmi $MPIEXEC -v -kill -comm
>pmi $PROG -i $PBS_O_WORKDIR/$1 -o $PBS_O_WORKDIR/$2.log -r -f -ex
>$PROG_PATH -b $PROG_PATH -t $SCRATCH/RUN

>3.
>Input:
> $CONTRL MAXIT=100 mplevl=2 inttyp=hondo icut=10 runtyp=gradient d5=1 $END
> $SYSTEM MWORDS=250 TIMLIM=100000 async=1 $END
> $p2p p2p=.t. xdlb=.t. $end
> $mp2 method=1 $end
> $scf dirscf=.t. $end
> $basis gbasis=n21 ngauss=3 ndfunc=0 $end
> $DATA
>C60 at PBE
> DNH � 2

> CARBON � � �6.0 � �0.000000 � 0.699503 � 3.483274
> CARBON � � �6.0 � �3.483274 � 0.000000 � 0.699503
> CARBON � � �6.0 � �0.699503 � 3.483274 � 0.000000
> CARBON � � �6.0 � �1.175728 � 1.426142 � 3.034186
> CARBON � � �6.0 � �3.034186 � 1.175728 � 1.426142
> CARBON � � �6.0 � �1.426142 � 3.034186 � 1.175727
> CARBON � � �6.0 � �2.601870 � 2.307547 � 0.726640
> CARBON � � �6.0 � �0.726640 � 2.601870 � 2.307547
> CARBON � � �6.0 � �2.307547 � 0.726640 � 2.601870
> $END

>tail of the log:
> FIRST �MP2 ENERGY/GRADIENT TEI HALF-TRANSFORMATION
> # OF WORDS AVAILABLE � � �= �249989366
> # OF WORDS USED � � � � � = � �4404679
> THRESHOLD FOR KEEPING HALF-TRANSFORMED 2E-INTEGRALS = �1.000E-09

> OPENING FILE DASORT WITH � �24117 LOGICAL RECORDS OF � 49152 WORDS
> DASORT WILL CONTAIN A MAXIMUM OF � 18521856 PHYSICAL RECORDS OF
>LENGTH � � �64 WORDS

>This works OK with NFS.
>I found that setting $system ioflag(30) $end helps with MP2 energy,
>but not gradient...

>Thanks,
> Maryna

>On Wed Dec 16 '09 7:41pm, Alex Granovsky wrote
>----------------------------------------------
>>Dear Maryna,

>>could you please provide some more information:

>>1. What was the exact archive name (i.e., CPU type, MPI, and linking type) of the Firefly distribution you are using?

>>2. How much nodes was used and what was the exact command line string to launch Firefly?

>>3. The complete input as well as partial output would be very helpful.

>>Regards,
>>Alex Granovsky
>>
>>
>>
>>On Wed Dec 16 '09 3:01pm, Maryna V. Krasovska wrote
>>---------------------------------------------------
>>>Dear Dr. Granovsky,

>>>I got the following error while trying to perform method 1 MP2 (either
>>>energy or gradient) using Lustre-1.8.1 file system for scratch files
>>>with �Firefly version 7.1.G, build number 5618.

>>>---------------------------------------------------
>>>I/O error is :: Interrupted system call

>>>TID 29821 caught signal 11, exiting.

>>>Dump of registers follows

>>>eax :: 0x00000000, edx :: 0x00000001
>>>ecx :: 0x64d2fec0, ebx :: 0x64cefe48
>>>esi :: 0x56e0f690, edi :: 0x00000011
>>>ebp :: 0xffcce050, esp :: 0xffcce000
>>>eip :: 0x104a850e, eflags :: 0x00210246

>>>cs �:: 0x0023
>>>ds �:: 0x002b
>>>es �:: 0x002b
>>>ss �:: 0x002b
>>>fs �:: 0x00d7
>>>gs �:: 0x0063

>>>Stack backtrace

>>>esp :: 0xffcce050, ebp :: 0x00000169, eip :: 0x11b0ee0c
>>>---------------------------------------------------

>>>dmesg:

>>>---------------------------------------------------
>>>firefly[31408]: segfault at 0000000000006ca1 rip 00000000555beaf5 rsp
>>>00000000ffce564c error 6
>>>firefly[32702]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>00000000ff9e2980 error 4
>>>firefly[32704]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>00000000ffc51b90 error 4
>>>firefly[32703]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>00000000ffa5f940 error 4
>>>firefly[328]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>00000000ffc4e950 error 4
>>>firefly[330]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>00000000fff8c950 error 4
>>>firefly[329]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>00000000ff8214c0 error 4
>>>---------------------------------------------------

>>>The output ends at shell #68 (of 180) of first half-transformation.
>>>No special I/O options were activates in input file.
>>>NFS works for scratch files on my cluster.
>>>Other programs works well with Lustre.

>>>Thanks for support!
>>>


[ Previous ] [ Next ] [ Index ]           Wed Dec 16 '09 9:14pm
[ Reply ] [ Edit ] [ Delete ]           This message read 908 times