Closed
Description
This is a regression from the v3.0.x series to v4/master/v5.
#include<mpi.h>
#include<stdio.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc,&argv);
MPI_Comm comm;
MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_NODE, 0, MPI_INFO_NULL, &comm);
int lsize;
MPI_Comm_size(comm, &lsize);
fprintf(stderr, "local_size = %d\n", lsize);
MPI_Finalize();
return 0;
}
v3.0.6 run from tarball:
$. ./exports/bin/mpirun -prefix `pwd`/exports -mca pml ob1 -np 4 -host hostA:2,hostB:2 ./split
local_size = 2
local_size = 2
local_size = 2
local_size = 2
v4.* and master/v5 either will hang or error out with pml/ob1, for example:
$. ./exports/bin/mpirun --prefix `pwd`/exports --np 4 --hostfile ./hostfile --mca pml ob1 ./split
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: hostB
PID: 160658
Message: connect() to X:1025 failed
Error: No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process. This
should not happen.
Your Open MPI job may now hang or fail.
Local host: hostB
PID: 160657
Message: connect() to X:1024 failed
Error: No route to host (113)
--------------------------------------------------------------------------
malloc debug: pml:ob1: mca_pml_ob1_match_completion_free: operation failed with code -12
*** An error occurred in Socket closed
*** reported by process [3261005825,3]
*** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
*** An error occurred in Socket closed
*** reported by process [3261005825,2]
*** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and MPI will try to terminate your MPI job as well)
This broke sometime in the v4 timeframe, as v4.0.0 from the tarball seems to work.
Testing this with ucx it fails as well, at least with ucx v1.7. I would need to try a more recent release to verify it is still an issue.