Skip to content

topo communicator error segv #89

Open
@ompiteam

Description

@ompiteam

I brought this up with Edgar already; we're looking into it, but I wanted to file a ticket as well so that the issue isn't forgotten.

I ran across a "bad" error scenario in OMPI -- meaning that if I have an error where no topo modules are selected, we should be failing gracefully (i.e., invoking an MPI exception). However, we're aborting due to an assert failure: we're calling OBJ_RELEASE on random memory that does not have a good magic ID, so it aborts.

You can force this error to occur if you change ompi/communicator/comm.c:1342 from

   if (NULL == new_comm->c_topo_comm) {

to artificially always make it go into an error condition:

   if (1 || NULL == new_comm->c_topo_comm) {

this will then cause OBJ_RELEASE(new_comm) to be invoked. Shortly thereafter, it aborts. The problem seems to be in the communicator destructor, where it invokes:

   if (NULL != comm->c_local_group) {
       ompi_group_decrement_proc_count (comm->c_local_group);
       OBJ_RELEASE ( comm->c_local_group );

In ompi_group_decrement_proc_count(), it essentially does this:

   for (proc = 0; proc < group->grp_proc_count; proc++) {
     proc_pointer = ompi_group_peer_lookup(group,proc);
     OBJ_RELEASE(proc_pointer);
   }

The OBJ_RELEASE here of the proc_pointer is what is invoking the abort:

MPI_Cart_map_c: group/group_init.c:227: ompi_group_decrement_proc_count: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (proc_pointer))->obj_magic_id' failed.

Edgar thinks that the topo comm creation is already "special" and we're probably just missing a step or haven't protected the destructor properly. He's probably right -- we'll keep digging.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions