Skip to content

cpu-set/cpu-list/pe-list no longer understand CPU range syntax #12235

Closed
openpmix/prrte
#1909
@nikitakuklev

Description

@nikitakuklev

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI v5.0.1
PRTE 3.0.3rc12024-01-11

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Spack [latest git]

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

n/a

Please describe the system on which you are running

  • Operating system/version: RHEL8
  • Computer hardware: 2x64 EPYC Milan
  • Network type: IB (Mellanox)

Details of the problem

I am trying to limit the execution of a program to a specific subset of cores, such that several instances can run on a single node as part of same cluster job.

With V4, this works: mpirun -np 64 --cpu-set 64-127 (i.e. run on whole CPU 1)

With V5, manual still states that "The list is comprised of comma-delimited ranges of CPUs to use for this job". However, it seems something has gone wrong at PMIx layer - PRRTE cannot parse the core range syntax. The final error, produced by both mpirun and prterun (prrte converts cpu-set to pe-list), is:

mpirun --cpu-set=1-2 --display bind,map hostname
OR prterun -v --map-by pe-list=1-2 --display bind,map hostname  

--------------------------------------------------------------------------
The specified mapping directive is not recognized:

  Directive: pe-list=1-2

Please check for a typo or ensure that the directive is a supported
one.

Using commas works, but of course I don't want to specify 64 cores explicitly.

prterun -v --map-by pe-list=1,2 --display bind,map hostname

========================   JOB MAP   ========================
Data for JOB prterun-ilogin4-2019644@1 offset 0 Total slots allocated 128
    Mapping policy: PE-LIST:NOOVERSUBSCRIBE  Ranking policy: SLOT Binding policy: CORE
    Cpu set: 1,2  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: ilogin4  Num slots: 128  Max slots: 0    Num procs: 2
        Process jobid: prterun-ilogin4-2019644@1 App: 0 Process rank: 0 Bound: package[0][core:1-2]
        Process jobid: prterun-ilogin4-2019644@1 App: 0 Process rank: 1 Bound: package[0][core:1-2]

=============================================================
[<>:2019644] Rank 0 bound to package[0][core:1-2]
[<>:2019644] Rank 1 bound to package[0][core:1-2]
...

Hopefully this is a bug or there is another range syntax. If the latter, would be useful to note it in the manual somewhere.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions