Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI v5.0.1
PRTE 3.0.3rc12024-01-11
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Spack [latest git]
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
n/a
Please describe the system on which you are running
- Operating system/version: RHEL8
- Computer hardware: 2x64 EPYC Milan
- Network type: IB (Mellanox)
Details of the problem
I am trying to limit the execution of a program to a specific subset of cores, such that several instances can run on a single node as part of same cluster job.
With V4, this works: mpirun -np 64 --cpu-set 64-127
(i.e. run on whole CPU 1)
With V5, manual still states that "The list is comprised of comma-delimited ranges of CPUs to use for this job". However, it seems something has gone wrong at PMIx layer - PRRTE cannot parse the core range syntax. The final error, produced by both mpirun and prterun (prrte converts cpu-set to pe-list), is:
mpirun --cpu-set=1-2 --display bind,map hostname
OR prterun -v --map-by pe-list=1-2 --display bind,map hostname
--------------------------------------------------------------------------
The specified mapping directive is not recognized:
Directive: pe-list=1-2
Please check for a typo or ensure that the directive is a supported
one.
Using commas works, but of course I don't want to specify 64 cores explicitly.
prterun -v --map-by pe-list=1,2 --display bind,map hostname
======================== JOB MAP ========================
Data for JOB prterun-ilogin4-2019644@1 offset 0 Total slots allocated 128
Mapping policy: PE-LIST:NOOVERSUBSCRIBE Ranking policy: SLOT Binding policy: CORE
Cpu set: 1,2 PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE
Data for node: ilogin4 Num slots: 128 Max slots: 0 Num procs: 2
Process jobid: prterun-ilogin4-2019644@1 App: 0 Process rank: 0 Bound: package[0][core:1-2]
Process jobid: prterun-ilogin4-2019644@1 App: 0 Process rank: 1 Bound: package[0][core:1-2]
=============================================================
[<>:2019644] Rank 0 bound to package[0][core:1-2]
[<>:2019644] Rank 1 bound to package[0][core:1-2]
...
Hopefully this is a bug or there is another range syntax. If the latter, would be useful to note it in the manual somewhere.