Hard to know for certain, but this looks like a OS problem rather than
a WIen2k issue. Things to check:
1) Use ompi_info and check, carefully, that the compilation
options/libraries used for openmpi are the same as what you are using
to compile Wien2k.
2) For 10.1, ulimit -s is not needed for mpi (and in any case does
nothing with openmpi) as this is done in software in Wien2kutils.c.
Make sure that you are exporting environmental parameters in your
mpirun call, for instance use in parallel_options
setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ _EXEC_"
3) Check the size of the job you are running, e.g. via top, by looking
in case.output1_X. by using "lapw1 -p -nmat_only", using ganglia or
nmon, cat /proc/meminfo (or anything else you have available).
Particularly with openmpi but with some other flavors as well, if you
are asking for too much memory and/or have too many processes running,
problems occur. A race condition can also occur in openmpi which makes
this problem worse (maybe patched in latest version, I am not sure).
4) Check, carefully (twice) for format errors in the input files. It
turns out that ifort has it's own signal traps so a child can exit
without correctly calling mpi_abort. A race condition can occur with
openmpi when the parent is trying to find a child, the child does not
exist, the parent waits then keeps going....
5) Check the OS logs in /var/log (beyond my competence). You may have
too high a nfs load, bad infiniband/myrinet (recent OFED?) etc. Use
-assu buff in compilation options to reduce nfs load.
Post by bothina hamadDear Wien users,
When running optimisation jobs under torque queuing system for anything but
Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully but eventually the 'mom-superior' node (that launches ) mpirun becomes non-communicating with the other nodes involved with the job.
At the console of this node there is correct load (4 for quad processor) and memory free... but can no longer access any nfs mounts, can no longer ping other nodes in cluster... am eventually forced to reboot node and kill job from cluster queuing system (job enters 'E' state and stays there... need to stop pbs_server and manually remove jobfiles from /var/spool/torque/server_priv/jobs... then restart pbs_server)
A similar problem is encountered on larger cluster (same install procedure) but with added problem that the .dayfile reports that for lapw2 only the 'mom-superior' node is reporting doing work (even though logging into other job nodes top reports correct load and 100%cpu use).
DOS calculation seems to work properly on both clusters...
We have used a modified x_lapw that you provided earlier.
We have been inserting 'ulimit -s unlimited' into job-scripts
We are using...
Centos5.3 x86_64
Intel compiler suite with mkl v11.1/072
openmpi-1.4.2, compiled with intel compilers
fftw-2.1.5, compiled with intel compilers and openmpi above
Wien2k v10.1
Optimisation jobs for small systems complete OK on both clusters.
The working directories for this job are large (>2GB).
?Please let us know what
files we could send you from these that may be helpful for diagnosis...
Best regards
Bothina
_______________________________________________
Wien mailing list
Wien at zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
--
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.