Skip to content

MPI_Comm_split_type() fails or hangs when run with processes across node #9010

Closed
@awlauria

Description

@awlauria

This is a regression from the v3.0.x series to v4/master/v5.

#include<mpi.h>
#include<stdio.h>
int main(int argc, char* argv[])
{
    MPI_Init(&argc,&argv);
    MPI_Comm comm;
    MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_NODE, 0, MPI_INFO_NULL, &comm);

    int lsize;
    MPI_Comm_size(comm, &lsize);
    fprintf(stderr, "local_size = %d\n", lsize);
   MPI_Finalize();
   return 0;
}

v3.0.6 run from tarball:

$. ./exports/bin/mpirun -prefix `pwd`/exports -mca pml ob1 -np 4 -host hostA:2,hostB:2 ./split
local_size = 2
local_size = 2
local_size = 2
local_size = 2

v4.* and master/v5 either will hang or error out with pml/ob1, for example:

$. ./exports/bin/mpirun --prefix `pwd`/exports --np 4 --hostfile ./hostfile --mca pml ob1 ./split
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: hostB
  PID:        160658
  Message:    connect() to X:1025 failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.

  Local host: hostB
  PID:        160657
  Message:    connect() to  X:1024 failed
  Error:      No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
*** An error occurred in Socket closed
*** reported by process [3261005825,3]
 *** on a NULL communicator
 *** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***    and MPI will try to terminate your MPI job as well)
 *** An error occurred in Socket closed
 *** reported by process [3261005825,2]
 *** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and MPI will try to terminate your MPI job as well)

This broke sometime in the v4 timeframe, as v4.0.0 from the tarball seems to work.

Testing this with ucx it fails as well, at least with ucx v1.7. I would need to try a more recent release to verify it is still an issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions