Description
I brought this up with Edgar already; we're looking into it, but I wanted to file a ticket as well so that the issue isn't forgotten.
I ran across a "bad" error scenario in OMPI -- meaning that if I have an error where no topo modules are selected, we should be failing gracefully (i.e., invoking an MPI exception). However, we're aborting due to an assert failure: we're calling OBJ_RELEASE on random memory that does not have a good magic ID, so it aborts.
You can force this error to occur if you change ompi/communicator/comm.c:1342 from
if (NULL == new_comm->c_topo_comm) {
to artificially always make it go into an error condition:
if (1 || NULL == new_comm->c_topo_comm) {
this will then cause OBJ_RELEASE(new_comm) to be invoked. Shortly thereafter, it aborts. The problem seems to be in the communicator destructor, where it invokes:
if (NULL != comm->c_local_group) {
ompi_group_decrement_proc_count (comm->c_local_group);
OBJ_RELEASE ( comm->c_local_group );
In ompi_group_decrement_proc_count(), it essentially does this:
for (proc = 0; proc < group->grp_proc_count; proc++) {
proc_pointer = ompi_group_peer_lookup(group,proc);
OBJ_RELEASE(proc_pointer);
}
The OBJ_RELEASE here of the proc_pointer is what is invoking the abort:
MPI_Cart_map_c: group/group_init.c:227: ompi_group_decrement_proc_count: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (proc_pointer))->obj_magic_id' failed.
Edgar thinks that the topo comm creation is already "special" and we're probably just missing a step or haven't protected the destructor properly. He's probably right -- we'll keep digging.