Discussion:
[Wien] error in parallel lapw2
BUSHRA SABIR
2018-10-20 07:58:28 UTC
Permalink
Dear Peter Blaha and wien2k users
I am facing one problem in parallel execution of job script. I am working on LaXO3 materials. initialization is ok but when i submitted job file on cluster for parallel execution with command line runsp_lapw -cc 0.001 -ec 0.0001 -i 40 -p .

following error apears.cat *.error

'LAPW2' - can't open unit: 18                                                
 'LAPW2' -        filename: LAO.vspup                                         
 'LAPW2' -          status: old          form: formatted                      
**  testerror: Error in Parallel LAPW2

file LAO.vspup is missing, i think it automatically generated during parallel lapw2

i checked testpara1_lapw#####################################################
#                     TESTPARA1                     #
#####################################################

Sat Oct 20 00:22:39 PDT 2018

    lapw1para has finished
 for testpara2_lapw#####################################################
#                     TESTPARA1                     #
#####################################################

Sat Oct 20 00:22:39 PDT 2018

    lapw1para has finished

At the end of day file following error is shown
0.088u 0.060s 0:05.14 2.7%    0+0k 0+288io 0pf+0w
   lapw2 -up -p          (23:56:15) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.048u 0.312s 0:00.72 48.6%    0+0k 11386+96io 36pf+0w
error: command   /global/common/sw/cray/cnl6/haswell/wien2k/17.1/intel/17.0.2.174/wkteycp/lapw2para -up uplapw2.def   failed

i go through mailing list but could not find solution.

BushraPhD student
Gavin Abo
2018-10-20 14:31:14 UTC
Permalink
1. It looks like you are using WIEN2k 17.1.  Some serious bugs were
found in that version [
http://susi.theochem.tuwien.ac.at/reg_user/updates/ ]. Consider
installing and using WIEN2k 18.2 which has the fixes to it. Also, WIEN2k
18.2 can be patched according to previous mailing list posts [
https://github.com/gsabo/WIEN2k-Patches/tree/master/18.2 ].

2. Regarding your "file LAO.vspup is missing, i think it automatically
generated during parallel lapw2", the case.vspup file should have been
generated by lapw0.  See Table 4.3 on page 36 of the WIEN2k 18.2
usersguide [
http://susi.theochem.tuwien.ac.at/reg_user/textbooks/usersguide.pdf ]
where it has program LAPW0 generates necessary case.vsp(up/dn).

3. I suggest you investigate why the LAO.vspup "can't open unit: 18"
error happens with lapw2 but not with lapw1.  For example, did LAO.vspup
exist with a non-zero file size after lapw0 completed, did it exist with
a non-zero file size for lapw1, and did it get deleted or become zero in
file size or loose node connection(s) just before lapw2?

Is your .machines setup to run k-point parallel, mpi parallel, or a mix
of both?  It looks like the job script that creates the .machines on the
fly was not provided that shows that.

If mpi parallel, using WIEN2k 18.2:

1. Run: ./siteconfig
2. Select Compiling Options, Selection: O
3. Select Parallel options, Selection: PO
4. What is MPIRUN set to?

You also might check your mpirun command and talk with your cluster
administrator to see if a supported mpi run command is being used for
the system [
https://www.mail-archive.com/***@zeus.theochem.tuwien.ac.at/msg17628.html
].

Have you checked the standard output/error file?  This file name can
vary from one system to another.  So you have to check your
scheduling/queue system documentation to see what the default file(s) is
called or use an option to name it yourself [ for example,
https://www.mail-archive.com/***@zeus.theochem.tuwien.ac.at/msg18080.html
].  If there is a mpi run error, it usually shows up in that file.

You also might have to check the hidden dot files [
https://www.mail-archive.com/***@zeus.theochem.tuwien.ac.at/msg17317.html
] and output files (like case.output0, case.output1, etc.).
Post by BUSHRA SABIR
Dear Peter Blaha and wien2k users
I am facing one problem in parallel execution of job script. I am
working on LaXO3 materials. initialization is ok but when i submitted
job file on cluster for parallel execution with command line
runsp_lapw -cc 0.001 -ec 0.0001 -i 40 -p .
following error apears.cat *.error
'LAPW2' - can't open unit: 18
 'LAPW2' -        filename: LAO.vspup
 'LAPW2' -          status: old          form: formatted
**  testerror: Error in Parallel LAPW2
file LAO.vspup is missing, i think it automatically generated during parallel lapw2
i checked testpara1_lapw
#####################################################
# TESTPARA1                     #
#####################################################
Sat Oct 20 00:22:39 PDT 2018
    lapw1para has finished
 for testpara2_lapw
#####################################################
# TESTPARA1                     #
#####################################################
Sat Oct 20 00:22:39 PDT 2018
    lapw1para has finished
At the end of day file following error is shown
0.088u 0.060s 0:05.14 2.7%    0+0k 0+288io 0pf+0w
   lapw2 -up -p          (23:56:15) running LAPW2 in parallel mode
**  LAPW2 crashed!
0.048u 0.312s 0:00.72 48.6%    0+0k 11386+96io 36pf+0w
error: command
/global/common/sw/cray/cnl6/haswell/wien2k/17.1/intel/17.0.2.174/wkteycp/lapw2para
-up uplapw2.def   failed
i go through mailing list but could not find solution.
Bushra
PhD student
Dr. K. C. Bhamu
2018-10-20 18:37:37 UTC
Permalink
Dear Gavin,
(updated)
I am writing on behalf of Ms. Bushra, as she is not able to reply for now,
with some test on the same cluster with wien2k version 17.1 and 18.2.

The actual error what she/me see is "/usr/common/nsg/bin/mpirun: Permission
denied" which may be solved by cluster admin only.

For Wien2k_17.1 the mpirun was defined as "mpirun -n _NP_ -machinefile
_HOSTS_ _EXEC_"

As in one of the thread Prof. Peter suggested to use "ifort + slurm".

Yes, I just installed Wien2k_18.2 at NERSC with ifort+slurm system
environment.

and the mpirun command is now "srun -K -N_nodes_ -n_NP_ -r_offset_
_PINNING_ _EXEC_"

But still I face same error.

The error is same and it does't matter if we have mpirun or srun [1]. Only
srun and mpirun word changes in the error.


In the past I faces same error and cluster admin only could solve so let us
first write to cluster admin and will update here the final outcome.

If you have any advice that can help to get rid of this issue please let us
know.

[1]
srun: error: No hardware architecture specified (-C)!
srun: error: Unable to allocate resources: Unspecified error
srun: fatal: --relative option invalid for job allocation request
srun: error: No hardware architecture specified (-C)!
srun: error: Unable to allocate resources: Unspecified error
LAO.scf1up_1: No such file or directory.
grep: No match.
srun: fatal: --relative option invalid for job allocation request
srun: error: No hardware architecture specified (-C)!
srun: error: Unable to allocate resources: Unspecified error
LAO.scf1dn_1: No such file or directory.
grep: No match.
LAPW2 - Error. Check file lapw2.error
cp: cannot stat '.in.tmp': No such file or directory
grep: No match.
grep: No match.
grep: No match.
stop error
1. It looks like you are using WIEN2k 17.1. Some serious bugs were found
in that version [ http://susi.theochem.tuwien.ac.at/reg_user/updates/ ].
Consider installing and using WIEN2k 18.2 which has the fixes to it. Also,
WIEN2k 18.2 can be patched according to previous mailing list posts [
https://github.com/gsabo/WIEN2k-Patches/tree/master/18.2 ].
2. Regarding your "file LAO.vspup is missing, i think it automatically
generated during parallel lapw2", the case.vspup file should have been
generated by lapw0. See Table 4.3 on page 36 of the WIEN2k 18.2 usersguide
[ http://susi.theochem.tuwien.ac.at/reg_user/textbooks/usersguide.pdf ]
where it has program LAPW0 generates necessary case.vsp(up/dn).
3. I suggest you investigate why the LAO.vspup "can't open unit: 18" error
happens with lapw2 but not with lapw1. For example, did LAO.vspup exist
with a non-zero file size after lapw0 completed, did it exist with a
non-zero file size for lapw1, and did it get deleted or become zero in file
size or loose node connection(s) just before lapw2?
Is your .machines setup to run k-point parallel, mpi parallel, or a mix of
both? It looks like the job script that creates the .machines on the fly
was not provided that shows that.
1. Run: ./siteconfig
2. Select Compiling Options, Selection: O
3. Select Parallel options, Selection: PO
4. What is MPIRUN set to?
You also might check your mpirun command and talk with your cluster
administrator to see if a supported mpi run command is being used for the
system [
].
Have you checked the standard output/error file? This file name can vary
from one system to another. So you have to check your scheduling/queue
system documentation to see what the default file(s) is called or use an
option to name it yourself [ for example,
]. If there is a mpi run error, it usually shows up in that file.
You also might have to check the hidden dot files [
] and output files (like case.output0, case.output1, etc.).
Dear Peter Blaha and wien2k users
I am facing one problem in parallel execution of job script. I am working
on LaXO3 materials. initialization is ok but when i submitted job file on
cluster for parallel execution with command line runsp_lapw -cc 0.001 -ec
0.0001 -i 40 -p .
following error apears.cat *.error
18
LAO.vspup
formatted
** testerror: Error in Parallel LAPW2
file LAO.vspup is missing, i think it automatically generated during parallel lapw2
i checked testpara1_lapw
#####################################################
# TESTPARA1 #
#####################################################
Sat Oct 20 00:22:39 PDT 2018
lapw1para has finished
for testpara2_lapw
#####################################################
# TESTPARA1 #
#####################################################
Sat Oct 20 00:22:39 PDT 2018
lapw1para has finished
At the end of day file following error is shown
0.088u 0.060s 0:05.14 2.7% 0+0k 0+288io 0pf+0w
lapw2 -up -p (23:56:15) running LAPW2 in parallel mode
** LAPW2 crashed!
0.048u 0.312s 0:00.72 48.6% 0+0k 11386+96io 36pf+0w
error: command /global/common/sw/cray/cnl6/haswell/wien2k/17.1/intel/
17.0.2.174/wkteycp/lapw2para -up uplapw2.def failed
i go through mailing list but could not find solution.
Bushra
PhD student
_______________________________________________
Wien mailing list
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
Gavin Abo
2018-10-20 18:54:29 UTC
Permalink
Are you using the NERSC cluster mentioned at:

http://www.nersc.gov/users/software/applications/materials-science/wien2k/

If so, I see:

Need Help?

Consulting and questions

https://help.nersc.gov
1-800-66-NERSC, option 3
or 510-486-8613 Monday - Friday 8-5 Pacific

It should be better to ask that cluster help desk if that is the case.

The error says "srun: error: No hardware architecture specified (-C)!".

Probably you need to add "#SBATCH -C haswell" to the job script or one
of the other architectures to remove that error as mentioned at:

http://www.nersc.gov/users/computational-systems/cori/running-jobs/batch-jobs/
Post by Dr. K. C. Bhamu
Dear Gavin,
(updated)
I am writing on behalf of Ms. Bushra, as she is not able to reply for
now, with some test on the same cluster with wien2k version 17.1 and 18.2.
Permission denied" which may be solved by cluster admin only.
For Wien2k_17.1 the mpirun was defined as "mpirun -n _NP_ -machinefile
_HOSTS_ _EXEC_"
As in one of the thread Prof. Peter suggested to use "ifort + slurm".
Yes, I just installed Wien2k_18.2 at NERSC with ifort+slurm system
environment.
and the mpirun command is now "srun -K -N_nodes_ -n_NP_ -r_offset_
_PINNING_ _EXEC_"
But still I face same error.
The error is same and it does't matter if we have mpirun or srun [1].
Only srun and mpirun word changes in the error.
In the past I faces same error and cluster admin only could solve so
let us first write to cluster admin and will update here the final
outcome.
If you have any advice that can help to get rid of this issue please
let us know.
[1]
srun: error: No hardware architecture specified (-C)!
srun: error: Unable to allocate resources: Unspecified error
srun: fatal: --relative option invalid for job allocation request
srun: error: No hardware architecture specified (-C)!
srun: error: Unable to allocate resources: Unspecified error
LAO.scf1up_1: No such file or directory.
grep: No match.
srun: fatal: --relative option invalid for job allocation request
srun: error: No hardware architecture specified (-C)!
srun: error: Unable to allocate resources: Unspecified error
LAO.scf1dn_1: No such file or directory.
grep: No match.
LAPW2 - Error. Check file lapw2.error
cp: cannot stat '.in.tmp': No such file or directory
grep: No match.
grep: No match.
grep: No match.
   stop error
Loading...