Firefly and PC GAMESS-related discussion club



Learn how to ask questions correctly


Re^4: PC GAMESS @ Lustre

Maryna V. Krasovska
mkrasovska@gmail.com


Unfortunately, this does not help.

In fact, Lustre file system we use on our cluster is much faster
compared to local HDDs (1 or 2 for eight cores, could it be fast?). By
the way, as far as I know, Lustre is used for storage on �Cray XT
machines (have no access to such machines though), so our
configuration might not be so exotic.
We use standard 2.6.18-128.7.1.el5_lustre.1.8.1.1 kernel, Lustre runs
over IB, no special parameters for distributed FS were used.


On Wed Dec 16 '09 9:14pm, Alex Granovsky wrote
----------------------------------------------
>Dear Maryna,

>My present guess is as follows.

>All the low-level I/O routines used by Firefly are EINTR (Interrupted system call) safe. However, this only applies to the calls that can return EINTR according to POSIX standards.
>In paricular, this does not apply to lseek() as it is supposed it never returns EINTR.

>The I/O calls used by MP2 code are write(), lseeek(), and read(), and the only of them that is not protected is lseeek().

>Frankly, we did not ever meet this problem before - however it seems to me that Lustre (unlike any other filesystems we tested so far) may interrupt lseeek() calls.
>We can create a special build of Firefly for you to check for and fix this problem. Meantime, the solution would be to disable all the async i/o features of MP2 gradient code (ASYNC and XASYNC of $mp2grd). If this does not help the next step would be to use the special build of p2p library that does not use signals. �

>However, the use of local filesystems is strongly preferred for better performance.

>Regards,
>Alex Granovsky
>
>
>On Wed Dec 16 '09 8:23pm, Maryna V. Krasovska wrote
>---------------------------------------------------
>>Dear Alex,

>>Thanks for reply.

>>1. firefly_71g_linux_impi_p4

>>2. Seems this does not matter.

>> Intel Core2/ Linux �Firefly version running under Linux.
>> Running on Intel CPU: �Brand ID �0, Family �6, Model �23, Stepping �6
>> CPU Brand String � �: �Intel(R) Xeon(R) CPU � � � � � X5460 �@ 3.16GHz
>> CPU Features � � � �: �CMOV, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, MWAIT, EM64T
>> Data cache size � � : �L1 32 KB, L2 6144 KB, L3 � � 0 KB
>> max � �# of � cores/package : � 4
>> max � �# of threads/package : � 4
>> max � � cache sharing level : � 2
>> Operating System successfully passed SSE support test.

>>PARALLEL VERSION (INTEL MPI) RUNNING USING � �8 PROCESSES (NODES)

>>My run script:
>>PROG_PATH=/share/apps/firefly/
>>PROG_BIN=firefly
>>PROG=$PROG_PATH/$PROG_BIN
>>SCRATCH=/ptmp/scratch/pbstmp.$PBS_JOBID
>>MPIEXEC=/opt/mpiexec/bin/mpiexec
>>source /share/apps/intel/impi/bin/mpivars.sh
>>cd $SCRATCH
>>/opt/mpiexec/bin/mpiexec �-v -kill -comm pmi $MPIEXEC -v -kill -comm
>>pmi $PROG -i $PBS_O_WORKDIR/$1 -o $PBS_O_WORKDIR/$2.log -r -f -ex
>>$PROG_PATH -b $PROG_PATH -t $SCRATCH/RUN

>>3.
>>Input:
>> $CONTRL MAXIT=100 mplevl=2 inttyp=hondo icut=10 runtyp=gradient d5=1 $END
>> $SYSTEM MWORDS=250 TIMLIM=100000 async=1 $END
>> $p2p p2p=.t. xdlb=.t. $end
>> $mp2 method=1 $end
>> $scf dirscf=.t. $end
>> $basis gbasis=n21 ngauss=3 ndfunc=0 $end
>> $DATA
>>C60 at PBE
>> DNH � 2

>> CARBON � � �6.0 � �0.000000 � 0.699503 � 3.483274
>> CARBON � � �6.0 � �3.483274 � 0.000000 � 0.699503
>> CARBON � � �6.0 � �0.699503 � 3.483274 � 0.000000
>> CARBON � � �6.0 � �1.175728 � 1.426142 � 3.034186
>> CARBON � � �6.0 � �3.034186 � 1.175728 � 1.426142
>> CARBON � � �6.0 � �1.426142 � 3.034186 � 1.175727
>> CARBON � � �6.0 � �2.601870 � 2.307547 � 0.726640
>> CARBON � � �6.0 � �0.726640 � 2.601870 � 2.307547
>> CARBON � � �6.0 � �2.307547 � 0.726640 � 2.601870
>> $END

>>tail of the log:
>> FIRST �MP2 ENERGY/GRADIENT TEI HALF-TRANSFORMATION
>> # OF WORDS AVAILABLE � � �= �249989366
>> # OF WORDS USED � � � � � = � �4404679
>> THRESHOLD FOR KEEPING HALF-TRANSFORMED 2E-INTEGRALS = �1.000E-09

>> OPENING FILE DASORT WITH � �24117 LOGICAL RECORDS OF � 49152 WORDS
>> DASORT WILL CONTAIN A MAXIMUM OF � 18521856 PHYSICAL RECORDS OF
>>LENGTH � � �64 WORDS

>>This works OK with NFS.
>>I found that setting $system ioflag(30) $end helps with MP2 energy,
>>but not gradient...

>>Thanks,
>> Maryna

>>On Wed Dec 16 '09 7:41pm, Alex Granovsky wrote
>>----------------------------------------------
>>>Dear Maryna,

>>>could you please provide some more information:

>>>1. What was the exact archive name (i.e., CPU type, MPI, and linking type) of the Firefly distribution you are using?

>>>2. How much nodes was used and what was the exact command line string to launch Firefly?

>>>3. The complete input as well as partial output would be very helpful.

>>>Regards,
>>>Alex Granovsky
>>>
>>>
>>>
>>>On Wed Dec 16 '09 3:01pm, Maryna V. Krasovska wrote
>>>---------------------------------------------------
>>>>Dear Dr. Granovsky,

>>>>I got the following error while trying to perform method 1 MP2 (either
>>>>energy or gradient) using Lustre-1.8.1 file system for scratch files
>>>>with �Firefly version 7.1.G, build number 5618.

>>>>---------------------------------------------------
>>>>I/O error is :: Interrupted system call

>>>>TID 29821 caught signal 11, exiting.

>>>>Dump of registers follows

>>>>eax :: 0x00000000, edx :: 0x00000001
>>>>ecx :: 0x64d2fec0, ebx :: 0x64cefe48
>>>>esi :: 0x56e0f690, edi :: 0x00000011
>>>>ebp :: 0xffcce050, esp :: 0xffcce000
>>>>eip :: 0x104a850e, eflags :: 0x00210246

>>>>cs �:: 0x0023
>>>>ds �:: 0x002b
>>>>es �:: 0x002b
>>>>ss �:: 0x002b
>>>>fs �:: 0x00d7
>>>>gs �:: 0x0063

>>>>Stack backtrace

>>>>esp :: 0xffcce050, ebp :: 0x00000169, eip :: 0x11b0ee0c
>>>>---------------------------------------------------

>>>>dmesg:

>>>>---------------------------------------------------
>>>>firefly[31408]: segfault at 0000000000006ca1 rip 00000000555beaf5 rsp
>>>>00000000ffce564c error 6
>>>>firefly[32702]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ff9e2980 error 4
>>>>firefly[32704]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ffc51b90 error 4
>>>>firefly[32703]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ffa5f940 error 4
>>>>firefly[328]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ffc4e950 error 4
>>>>firefly[330]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000fff8c950 error 4
>>>>firefly[329]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ff8214c0 error 4
>>>>---------------------------------------------------

>>>>The output ends at shell #68 (of 180) of first half-transformation.
>>>>No special I/O options were activates in input file.
>>>>NFS works for scratch files on my cluster.
>>>>Other programs works well with Lustre.

>>>>Thanks for support!
>>>>


[ Previous ] [ Next ] [ Index ]           Wed Dec 16 '09 10:43pm
[ Reply ] [ Edit ] [ Delete ]           This message read 953 times