Description
In a setup (using OpenMPI 4.1.3) with >14,000 processes, we noticed an unusually long initialization time. While investigating this, we found out that ~60 consecutive calls to MPI_Group_difference
involving a group, which contained all processes of the run, took several minutes. I suspect that the implementation of ompi_group_dense_overlap
(used by MPI_Group_difference
) is sub optimal for such cases, because it seems to use an algorithm with a time complexity of O(n²) .
We could replicate a similar functionality using a collective MPI_Allreduce
, which was many times faster, even though MPI_Group_difference
is a local operation.
A more sophisticated algorithm (by for example by using sorted lists of the processes of each group) should be able to improve the performance significantly.