Firefly and PC GAMESS-related discussion club



Learn how to ask questions correctly


Re^6: PC GAMESS @ Lustre

Alex Granovsky
gran@classic.chem.msu.su


Hi to all,

finally, we found the reason of Lustre/Firefly incompatibility.
Firefly version 7.1.H will include the fix for Lustre fs.
Meantime, if you need to perform calculations on Lustre fs,
please contact me off the list directly for the information
on the updated binaries.

Regards,
Alex Granovsky


On Sun Dec 20 '09 8:50am, Alex Granovsky wrote
----------------------------------------------
>Hi Maryna,

>I've verified your input (both gradients and just energy)
>on similar cluster that uses panfs and did not find any problems
>at all... Is it possible to get access to your cluster for testing?
>Alternatively, I could provide you the custom build of Firefly
>and p2p library to check what really happens running on Lustre...
>Please contact me directly if you are interested.

>Regards,
>Alex Granovsky

>P.S. As to local vs. nonlocal fs... We have lots of data on this
>and the best way really depends on many factors and there is no
>simple answer. However, I happened to run MP2 optimization using
>1024 cores... and this is where nonlocal fs usually becomes
>over-saturated by I/O requests...
>
>
>
>On Wed Dec 16 '09 10:43pm, Maryna V. Krasovska wrote
>----------------------------------------------------
>>Unfortunately, this does not help.

>>In fact, Lustre file system we use on our cluster is much faster
>>compared to local HDDs (1 or 2 for eight cores, could it be fast?). By
>>the way, as far as I know, Lustre is used for storage on �Cray XT
>>machines (have no access to such machines though), so our
>>configuration might not be so exotic.
>>We use standard 2.6.18-128.7.1.el5_lustre.1.8.1.1 kernel, Lustre runs
>>over IB, no special parameters for distributed FS were used.
>>
>>
>>On Wed Dec 16 '09 9:14pm, Alex Granovsky wrote
>>----------------------------------------------
>>>Dear Maryna,

>>>My present guess is as follows.

>>>All the low-level I/O routines used by Firefly are EINTR (Interrupted system call) safe. However, this only applies to the calls that can return EINTR according to POSIX standards.
>>>In paricular, this does not apply to lseek() as it is supposed it never returns EINTR.

>>>The I/O calls used by MP2 code are write(), lseeek(), and read(), and the only of them that is not protected is lseeek().

>>>Frankly, we did not ever meet this problem before - however it seems to me that Lustre (unlike any other filesystems we tested so far) may interrupt lseeek() calls.
>>>We can create a special build of Firefly for you to check for and fix this problem. Meantime, the solution would be to disable all the async i/o features of MP2 gradient code (ASYNC and XASYNC of $mp2grd). If this does not help the next step would be to use the special build of p2p library that does not use signals. �

>>>However, the use of local filesystems is strongly preferred for better performance.

>>>Regards,
>>>Alex Granovsky
>>>
>>>
>>>On Wed Dec 16 '09 8:23pm, Maryna V. Krasovska wrote
>>>---------------------------------------------------
>>>>Dear Alex,

>>>>Thanks for reply.

>>>>1. firefly_71g_linux_impi_p4

>>>>2. Seems this does not matter.

>>>> Intel Core2/ Linux �Firefly version running under Linux.
>>>> Running on Intel CPU: �Brand ID �0, Family �6, Model �23, Stepping �6
>>>> CPU Brand String � �: �Intel(R) Xeon(R) CPU � � � � � X5460 �@ 3.16GHz
>>>> CPU Features � � � �: �CMOV, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, MWAIT, EM64T
>>>> Data cache size � � : �L1 32 KB, L2 6144 KB, L3 � � 0 KB
>>>> max � �# of � cores/package : � 4
>>>> max � �# of threads/package : � 4
>>>> max � � cache sharing level : � 2
>>>> Operating System successfully passed SSE support test.

>>>>PARALLEL VERSION (INTEL MPI) RUNNING USING � �8 PROCESSES (NODES)

>>>>My run script:
>>>>PROG_PATH=/share/apps/firefly/
>>>>PROG_BIN=firefly
>>>>PROG=$PROG_PATH/$PROG_BIN
>>>>SCRATCH=/ptmp/scratch/pbstmp.$PBS_JOBID
>>>>MPIEXEC=/opt/mpiexec/bin/mpiexec
>>>>source /share/apps/intel/impi/bin/mpivars.sh
>>>>cd $SCRATCH
>>>>/opt/mpiexec/bin/mpiexec �-v -kill -comm pmi $MPIEXEC -v -kill -comm
>>>>pmi $PROG -i $PBS_O_WORKDIR/$1 -o $PBS_O_WORKDIR/$2.log -r -f -ex
>>>>$PROG_PATH -b $PROG_PATH -t $SCRATCH/RUN

>>>>3.
>>>>Input:
>>>> $CONTRL MAXIT=100 mplevl=2 inttyp=hondo icut=10 runtyp=gradient d5=1 $END
>>>> $SYSTEM MWORDS=250 TIMLIM=100000 async=1 $END
>>>> $p2p p2p=.t. xdlb=.t. $end
>>>> $mp2 method=1 $end
>>>> $scf dirscf=.t. $end
>>>> $basis gbasis=n21 ngauss=3 ndfunc=0 $end
>>>> $DATA
>>>>C60 at PBE
>>>> DNH � 2

>>>> CARBON � � �6.0 � �0.000000 � 0.699503 � 3.483274
>>>> CARBON � � �6.0 � �3.483274 � 0.000000 � 0.699503
>>>> CARBON � � �6.0 � �0.699503 � 3.483274 � 0.000000
>>>> CARBON � � �6.0 � �1.175728 � 1.426142 � 3.034186
>>>> CARBON � � �6.0 � �3.034186 � 1.175728 � 1.426142
>>>> CARBON � � �6.0 � �1.426142 � 3.034186 � 1.175727
>>>> CARBON � � �6.0 � �2.601870 � 2.307547 � 0.726640
>>>> CARBON � � �6.0 � �0.726640 � 2.601870 � 2.307547
>>>> CARBON � � �6.0 � �2.307547 � 0.726640 � 2.601870
>>>> $END

>>>>tail of the log:
>>>> FIRST �MP2 ENERGY/GRADIENT TEI HALF-TRANSFORMATION
>>>> # OF WORDS AVAILABLE � � �= �249989366
>>>> # OF WORDS USED � � � � � = � �4404679
>>>> THRESHOLD FOR KEEPING HALF-TRANSFORMED 2E-INTEGRALS = �1.000E-09

>>>> OPENING FILE DASORT WITH � �24117 LOGICAL RECORDS OF � 49152 WORDS
>>>> DASORT WILL CONTAIN A MAXIMUM OF � 18521856 PHYSICAL RECORDS OF
>>>>LENGTH � � �64 WORDS

>>>>This works OK with NFS.
>>>>I found that setting $system ioflag(30) $end helps with MP2 energy,
>>>>but not gradient...

>>>>Thanks,
>>>> Maryna

>>>>On Wed Dec 16 '09 7:41pm, Alex Granovsky wrote
>>>>----------------------------------------------
>>>>>Dear Maryna,

>>>>>could you please provide some more information:

>>>>>1. What was the exact archive name (i.e., CPU type, MPI, and linking type) of the Firefly distribution you are using?

>>>>>2. How much nodes was used and what was the exact command line string to launch Firefly?

>>>>>3. The complete input as well as partial output would be very helpful.

>>>>>Regards,
>>>>>Alex Granovsky
>>>>>
>>>>>
>>>>>
>>>>>On Wed Dec 16 '09 3:01pm, Maryna V. Krasovska wrote
>>>>>---------------------------------------------------
>>>>>>Dear Dr. Granovsky,

>>>>>>I got the following error while trying to perform method 1 MP2 (either
>>>>>>energy or gradient) using Lustre-1.8.1 file system for scratch files
>>>>>>with �Firefly version 7.1.G, build number 5618.

>>>>>>---------------------------------------------------
>>>>>>I/O error is :: Interrupted system call

>>>>>>TID 29821 caught signal 11, exiting.

>>>>>>Dump of registers follows

>>>>>>eax :: 0x00000000, edx :: 0x00000001
>>>>>>ecx :: 0x64d2fec0, ebx :: 0x64cefe48
>>>>>>esi :: 0x56e0f690, edi :: 0x00000011
>>>>>>ebp :: 0xffcce050, esp :: 0xffcce000
>>>>>>eip :: 0x104a850e, eflags :: 0x00210246

>>>>>>cs �:: 0x0023
>>>>>>ds �:: 0x002b
>>>>>>es �:: 0x002b
>>>>>>ss �:: 0x002b
>>>>>>fs �:: 0x00d7
>>>>>>gs �:: 0x0063

>>>>>>Stack backtrace

>>>>>>esp :: 0xffcce050, ebp :: 0x00000169, eip :: 0x11b0ee0c
>>>>>>---------------------------------------------------

>>>>>>dmesg:

>>>>>>---------------------------------------------------
>>>>>>firefly[31408]: segfault at 0000000000006ca1 rip 00000000555beaf5 rsp
>>>>>>00000000ffce564c error 6
>>>>>>firefly[32702]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>>>00000000ff9e2980 error 4
>>>>>>firefly[32704]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>>>00000000ffc51b90 error 4
>>>>>>firefly[32703]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>>>00000000ffa5f940 error 4
>>>>>>firefly[328]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>>>00000000ffc4e950 error 4
>>>>>>firefly[330]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>>>00000000fff8c950 error 4
>>>>>>firefly[329]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>>>00000000ff8214c0 error 4
>>>>>>---------------------------------------------------

>>>>>>The output ends at shell #68 (of 180) of first half-transformation.
>>>>>>No special I/O options were activates in input file.
>>>>>>NFS works for scratch files on my cluster.
>>>>>>Other programs works well with Lustre.

>>>>>>Thanks for support!
>>>>>>


[ Previous ] [ Next ] [ Index ]           Sat Jan 30 '10 8:51pm
[ Reply ] [ Edit ] [ Delete ]           This message read 968 times