Maryna V. Krasovska
mkrasovska@gmail.com
In fact, Lustre file system we use on our cluster is much faster
compared to local HDDs (1 or 2 for eight cores, could it be fast?). By
the way, as far as I know, Lustre is used for storage on �Cray XT
machines (have no access to such machines though), so our
configuration might not be so exotic.
We use standard 2.6.18-128.7.1.el5_lustre.1.8.1.1 kernel, Lustre runs
over IB, no special parameters for distributed FS were used.
On Wed Dec 16 '09 9:14pm, Alex Granovsky wrote
----------------------------------------------
>Dear Maryna,
>My present guess is as follows.
>All the low-level I/O routines used by Firefly are EINTR (Interrupted system call) safe. However, this only applies to the calls that can return EINTR according to POSIX standards.
>In paricular, this does not apply to lseek() as it is supposed it never returns EINTR.
>The I/O calls used by MP2 code are write(), lseeek(), and read(), and the only of them that is not protected is lseeek().
>Frankly, we did not ever meet this problem before - however it seems to me that Lustre (unlike any other filesystems we tested so far) may interrupt lseeek() calls.
>We can create a special build of Firefly for you to check for and fix this problem. Meantime, the solution would be to disable all the async i/o features of MP2 gradient code (ASYNC and XASYNC of $mp2grd). If this does not help the next step would be to use the special build of p2p library that does not use signals. �
>However, the use of local filesystems is strongly preferred for better performance.
>Regards,
>Alex Granovsky
>
>
>On Wed Dec 16 '09 8:23pm, Maryna V. Krasovska wrote
>---------------------------------------------------
>>Dear Alex,
>>Thanks for reply.
>>1. firefly_71g_linux_impi_p4
>>2. Seems this does not matter.
>> Intel Core2/ Linux �Firefly version running under Linux.
>> Running on Intel CPU: �Brand ID �0, Family �6, Model �23, Stepping �6
>> CPU Brand String � �: �Intel(R) Xeon(R) CPU � � � � � X5460 �@ 3.16GHz
>> CPU Features � � � �: �CMOV, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, MWAIT, EM64T
>> Data cache size � � : �L1 32 KB, L2 6144 KB, L3 � � 0 KB
>> max � �# of � cores/package : � 4
>> max � �# of threads/package : � 4
>> max � � cache sharing level : � 2
>> Operating System successfully passed SSE support test.
>>PARALLEL VERSION (INTEL MPI) RUNNING USING � �8 PROCESSES (NODES)
>>My run script:
>>PROG_PATH=/share/apps/firefly/
>>PROG_BIN=firefly
>>PROG=$PROG_PATH/$PROG_BIN
>>SCRATCH=/ptmp/scratch/pbstmp.$PBS_JOBID
>>MPIEXEC=/opt/mpiexec/bin/mpiexec
>>source /share/apps/intel/impi/bin/mpivars.sh
>>cd $SCRATCH
>>/opt/mpiexec/bin/mpiexec �-v -kill -comm pmi $MPIEXEC -v -kill -comm
>>pmi $PROG -i $PBS_O_WORKDIR/$1 -o $PBS_O_WORKDIR/$2.log -r -f -ex
>>$PROG_PATH -b $PROG_PATH -t $SCRATCH/RUN
>>3.
>>Input:
>> $CONTRL MAXIT=100 mplevl=2 inttyp=hondo icut=10 runtyp=gradient d5=1 $END
>> $SYSTEM MWORDS=250 TIMLIM=100000 async=1 $END
>> $p2p p2p=.t. xdlb=.t. $end
>> $mp2 method=1 $end
>> $scf dirscf=.t. $end
>> $basis gbasis=n21 ngauss=3 ndfunc=0 $end
>> $DATA
>>C60 at PBE
>> DNH � 2
>> CARBON � � �6.0 � �0.000000 � 0.699503 � 3.483274
>> CARBON � � �6.0 � �3.483274 � 0.000000 � 0.699503
>> CARBON � � �6.0 � �0.699503 � 3.483274 � 0.000000
>> CARBON � � �6.0 � �1.175728 � 1.426142 � 3.034186
>> CARBON � � �6.0 � �3.034186 � 1.175728 � 1.426142
>> CARBON � � �6.0 � �1.426142 � 3.034186 � 1.175727
>> CARBON � � �6.0 � �2.601870 � 2.307547 � 0.726640
>> CARBON � � �6.0 � �0.726640 � 2.601870 � 2.307547
>> CARBON � � �6.0 � �2.307547 � 0.726640 � 2.601870
>> $END
>>tail of the log:
>> FIRST �MP2 ENERGY/GRADIENT TEI HALF-TRANSFORMATION
>> # OF WORDS AVAILABLE � � �= �249989366
>> # OF WORDS USED � � � � � = � �4404679
>> THRESHOLD FOR KEEPING HALF-TRANSFORMED 2E-INTEGRALS = �1.000E-09
>> OPENING FILE DASORT WITH � �24117 LOGICAL RECORDS OF � 49152 WORDS
>> DASORT WILL CONTAIN A MAXIMUM OF � 18521856 PHYSICAL RECORDS OF
>>LENGTH � � �64 WORDS
>>This works OK with NFS.
>>I found that setting $system ioflag(30) $end helps with MP2 energy,
>>but not gradient...
>>Thanks,
>> Maryna
>>On Wed Dec 16 '09 7:41pm, Alex Granovsky wrote
>>----------------------------------------------
>>>Dear Maryna,
>>>could you please provide some more information:
>>>1. What was the exact archive name (i.e., CPU type, MPI, and linking type) of the Firefly distribution you are using?
>>>2. How much nodes was used and what was the exact command line string to launch Firefly?
>>>3. The complete input as well as partial output would be very helpful.
>>>Regards,
>>>Alex Granovsky
>>>
>>>
>>>
>>>On Wed Dec 16 '09 3:01pm, Maryna V. Krasovska wrote
>>>---------------------------------------------------
>>>>Dear Dr. Granovsky,
>>>>I got the following error while trying to perform method 1 MP2 (either
>>>>energy or gradient) using Lustre-1.8.1 file system for scratch files
>>>>with �Firefly version 7.1.G, build number 5618.
>>>>---------------------------------------------------
>>>>I/O error is :: Interrupted system call
>>>>TID 29821 caught signal 11, exiting.
>>>>Dump of registers follows
>>>>eax :: 0x00000000, edx :: 0x00000001
>>>>ecx :: 0x64d2fec0, ebx :: 0x64cefe48
>>>>esi :: 0x56e0f690, edi :: 0x00000011
>>>>ebp :: 0xffcce050, esp :: 0xffcce000
>>>>eip :: 0x104a850e, eflags :: 0x00210246
>>>>cs �:: 0x0023
>>>>ds �:: 0x002b
>>>>es �:: 0x002b
>>>>ss �:: 0x002b
>>>>fs �:: 0x00d7
>>>>gs �:: 0x0063
>>>>Stack backtrace
>>>>esp :: 0xffcce050, ebp :: 0x00000169, eip :: 0x11b0ee0c
>>>>---------------------------------------------------
>>>>dmesg:
>>>>---------------------------------------------------
>>>>firefly[31408]: segfault at 0000000000006ca1 rip 00000000555beaf5 rsp
>>>>00000000ffce564c error 6
>>>>firefly[32702]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ff9e2980 error 4
>>>>firefly[32704]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ffc51b90 error 4
>>>>firefly[32703]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ffa5f940 error 4
>>>>firefly[328]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ffc4e950 error 4
>>>>firefly[330]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000fff8c950 error 4
>>>>firefly[329]: segfault at 0000000000000004 rip 00000000003b9e33 rsp
>>>>00000000ff8214c0 error 4
>>>>---------------------------------------------------
>>>>The output ends at shell #68 (of 180) of first half-transformation.
>>>>No special I/O options were activates in input file.
>>>>NFS works for scratch files on my cluster.
>>>>Other programs works well with Lustre.
>>>>Thanks for support!
>>>>