Skip to content

Install Request: OpenMPI 4.0.2 or later (stale shared memory segments bug) #337

Closed
@heatherkellyucl

Description

@heatherkellyucl

Related to IN:04155266 on Legion but is a general issue, as our most recent OpenMPIs are 3.1.4 and 3.1.5 (beta module). OpenMPI 3 > 3.1.1 has a bug where vader_segment.x shared memory files are left behind (only/mostly on an aborted run?). If they exist, then a new run on those nodes will fail with this:

node-o08a-029: Unable to allocate shared memory for intra-node messaging.
node-o08a-029: Delete stale shared memory files in /dev/shm.

Note that /dev/shm is not full in this case.

OpenMPI 4.0.2 and later have fixed a bunch of vader issues, and are using PMIx 3 rather than 2, which has better hooks for doing job shutdown cleanup.

Note: 4.0.x deprecates the openib BTL in favour of UCX.
https://www.open-mpi.org/software/ompi/major-changes.php
https://www.open-mpi.org/faq/?category=openfabrics#run-ucx
https://www.open-mpi.org/faq/?category=building#build-p2p

It also suggests to build --without-verbs when using UCX.

See open-mpi/ompi#6322 and open-mpi/ompi#7220 for bug.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions