Skip to content

MPI-2 dynamic openib/rdmacm work request queue flush error #95

Open
@ompiteam

Description

@ompiteam

When running the cxx dynamics test, it ''sometimes'' fails with the following message (v1.4 branch -- did not test the trunk extensively to see if this was happening there):

[[56157,8],1][btl_openib_component.c:2951:handle_wc] from svbu-mpi042 to: svbu-mpi042 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 12750656 opcode 0  vendor error 249 qp_idx 0

There are no error messages before this. It always fails in the MPI::ARGV_NULL test, but I don't know if that means anything. The test fails this way in about 1 out of every 10 or 20 runs.

It's an odd error, because a FLUSHED event should only occur if some other error previously occurred that caused the flush.

FWIW, in my testing, I ''once'' got a segv in the "connect" spawned child process in rdmacm_component_finalize:

    for (item = opal_list_remove_first(&server_listener_list);                  
         NULL != item;                                                          
         item = opal_list_remove_first(&server_listener_list)) {                
        rdmacm_contents_t *contents = (rdmacm_contents_t*) item;                
        item2 = opal_list_remove_first(&(contents->ids));                       
        OBJ_RELEASE(item2);

gdb on the core dump showed that item2 was NULL. I'm not quite sure how that could happen! This only happened once in dozens of runs that I tried... but it ''did'' happen.

It's quite possible that the rdmacm CPC is required to make this error occur.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions