Call to MPI_Finalized hangs when called inside of MPI_Finalize #5084

AndrewGaspar · 2018-04-19T23:26:37Z

Background information

My program creates some keyvals with delete functions. Inside one of the delete functions, a call to MPI_Finalized is made. If an attribute associated with the keyval is deleted during the call to MPI_Finalize, the call to MPI_Finalized hangs because it attempts to acquire (what I assume is) the same lock used to check if MPI is finalized. ~~It's not clear from the spec that MPI_Finalized is disallowed when called in the course of an MPI_Finalize call.~~ See EDIT.

I verified the recursive lock issue in a debugger.

This issue is low priority - if the delete function is being called, it's obvious that MPI is not yet finalized.

EDIT: It seems that the standard explicitly requires MPI_Finalized return false in this situation, at least for MPI_COMM_SELF. This issue reproduces even for MPI_COMM_SELF (see below code).

8.7.1 Allowing User Functions at Process Termination

There are times in which it would be convenient to have actions happen when an MPI process finishes. For example, a routine may do initializations that are useful until the MPI job (or that part of the job that being terminated in the case of dynamically created processes) is finished. This can be accomplished in MPI by attaching an attribute to MPI_COMM_SELF with a callback function. When MPI_FINALIZE is called, it will first execute the equivalent of an MPI_COMM_FREE on MPI_COMM_SELF. This will cause the delete callback function to be executed on all keys associated with MPI_COMM_SELF, in the reverse order that they were set on MPI_COMM_SELF. If no key has been attached to MPI_COMM_SELF, then no callback is invoked. The “freeing” of MPI_COMM_SELF occurs before any other parts of MPI are a↵ected. Thus, for example, calling MPI_FINALIZED will return false in any of these callback functions. Once done with MPI_COMM_SELF, the order and rest of the actions taken by MPI_FINALIZE is not specified.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

> mpirun --version
mpirun (Open MPI) 3.0.0

Report bugs to http://www.open-mpi.org/community/help/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

On Mac: Spack
On CentOS7: system provided OpenMPI

Please describe the system on which you are running

Operating system/version: MacOS High Sierra, CentOS7
Computer hardware: MacBook Pro 2017
Network type: N/A

Details of the problem

Here's a minimum repro:

mpi-hang.c:

#include <mpi.h>
#include <stdio.h>

int delete_function(MPI_Comm comm, int comm_keyval, void *attribute_val, void *extra_state) {
    int flag;
    MPI_Finalized(&flag); // hangs when called from MPI_Finalize
    return MPI_SUCCESS;
}

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int keyval;
    MPI_Comm_create_keyval(MPI_COMM_NULL_COPY_FN, delete_function, &keyval, NULL);

    MPI_Comm_set_attr(MPI_COMM_SELF, keyval, NULL);

    printf("%d: Calling MPI_Finalize...\n", rank);
    MPI_Finalize();
    printf("%d: MPI_Finalize completed\n", rank);

    return 0;
}

Compiled with:

> mpicc mpi-hang.c -o mpi-hang

Executed with:

> mpirun -np 4 ./mpi-hang

Expected results:

> mpirun -np 4 ./mpi-hang
0: Calling MPI_Finalize...
1: Calling MPI_Finalize...
2: Calling MPI_Finalize...
3: Calling MPI_Finalize...
0: MPI_Finalize completed
1: MPI_Finalize completed
2: MPI_Finalize completed
3: MPI_Finalize completed
>

Actual results:

> mpirun -np 4 ./mpi-hang
0: Calling MPI_Finalize...
1: Calling MPI_Finalize...
2: Calling MPI_Finalize...
3: Calling MPI_Finalize...

The text was updated successfully, but these errors were encountered:

AndrewGaspar · 2018-04-19T23:48:54Z

I found some standard-eeze that requires explicit behavior for MPI_Finalized when called inside a user function during MPI_Finalize. I've updated the original issue with this information.

jsquyres · 2018-04-24T15:02:27Z

You are correct; this is a bug. I have a PR coming shortly. Thanks!

AndrewGaspar · 2018-04-24T17:32:43Z

Thanks!

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]>

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]>

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]>

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]>

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]>

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 35438ae)

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 35438ae)

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 35438ae)

jsquyres · 2018-06-12T15:19:22Z

Merged on all the release branches.

Fixed!

AndrewGaspar · 2018-06-12T21:11:08Z

I don't envy you the task. :)

@AndrewGaspar

Per MPI-3.1:8.7.1 p361:11-13, it's valid for MPI_FINALIZED to be invoked during an attribute destruction callback (e.g., during the destruction of keyvals on MPI_COMM_SELF during the very beginning of MPI_FINALIZE). In such cases, MPI_FINALIZED must return "false". Prior to this commit, we hung in FINALIZED if it were invoked during a COMM_SELF attribute destruction callback in FINALIZE. See open-mpi#5084. This commit converts the MPI_INITIALIZED / MPI_FINALIZED infrastructure to use a single enum (ompi_mpi_state, set atomically) to represent the state of MPI: - not initialized - init started - init completed - finalize started - finalize past COMM_SELF destruction - finalize completed The "finalize past COMM_SELF destruction" state is what allows us to return "false" from MPI_FINALIZED before COMM_SELF has been fully destroyed / all attribute callbacks have been invoked. Since this state is checked at nearly every MPI API call (to see if we're outside of the INIT/FINALIZE epoch), care was taken to use atomics to *set* the ompi_mpi_state value in ompi_mpi_init() and ompi_mpi_finalize(), but performance-critical code paths can simply read the variable without needing to use a slow call to an opal_atomic_*() function. Thanks to @AndrewGaspar for reporting the issue. Signed-off-by: Jeff Squyres <[email protected]>

jsquyres added the bug label Apr 24, 2018

jsquyres self-assigned this Apr 24, 2018

jsquyres mentioned this issue Apr 24, 2018

mpi/finalized: don't hang if called during MPI_FINALIZE #5092

Merged

jsquyres added Target: v2.x Target: v3.0.x Target: v3.1.x Target: main and removed Target: main labels Jun 3, 2018

jsquyres mentioned this issue Jun 3, 2018

v2.x: Fix MPI_FINALIZED_HANG #5217

Merged

This was referenced Jun 3, 2018

v3.0.x: mpi/finalized: revamp INITIALIZED/FINALIZED #5219

Merged

v3.1.x: mpi/finalized: revamp INITIALIZED/FINALIZED #5220

Merged

AndrewGaspar closed this as completed Jun 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Call to MPI_Finalized hangs when called inside of MPI_Finalize #5084

Call to MPI_Finalized hangs when called inside of MPI_Finalize #5084

AndrewGaspar commented Apr 19, 2018 •

edited

Loading

8.7.1 Allowing User Functions at Process Termination

AndrewGaspar commented Apr 19, 2018 •

edited

Loading

Uh oh!

jsquyres commented Apr 24, 2018

Uh oh!

AndrewGaspar commented Apr 24, 2018

Uh oh!

jsquyres commented Jun 12, 2018

Uh oh!

AndrewGaspar commented Jun 12, 2018

Uh oh!

Call to MPI_Finalized hangs when called inside of MPI_Finalize #5084

Call to MPI_Finalized hangs when called inside of MPI_Finalize #5084

Comments

AndrewGaspar commented Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background information

8.7.1 Allowing User Functions at Process Termination

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

AndrewGaspar commented Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsquyres commented Apr 24, 2018

Uh oh!

AndrewGaspar commented Apr 24, 2018

Uh oh!

jsquyres commented Jun 12, 2018

Uh oh!

AndrewGaspar commented Jun 12, 2018

Uh oh!

AndrewGaspar commented Apr 19, 2018 •

edited

Loading

AndrewGaspar commented Apr 19, 2018 •

edited

Loading