Skip to content

v2.x: opal_mutex_unlock: Operation not permitted errors #1586

Closed
@jsquyres

Description

@jsquyres

Seeing this error in multiple v2.x runs, in different contexts. Example:

This seems to indicate that we have an errant mutex unlock somewhere. The stack traces are varied. For example, the stack trace corresponding to the first one is:

00:48:23 + taskset -c 16,17 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/latency_th 8
00:48:23 opal_mutex_unlock: Operation not permitted
00:48:23 [jenkins01:17237] *** Process received signal ***
00:48:23 [jenkins01:17237] Signal: Aborted (6)
00:48:23 [jenkins01:17237] Signal code:  (-6)
00:48:23 [jenkins01:17237] [ 0] /lib64/libpthread.so.0[0x3d6980f710]
00:48:23 [jenkins01:17237] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d69032925]
00:48:23 [jenkins01:17237] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d69034105]
00:48:23 [jenkins01:17237] [ 3] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/openmpi/mca_oob_ud.so(+0x3efb)[0x7ffff4542efb]
00:48:23 [jenkins01:17237] [ 4] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/openmpi/mca_oob_ud.so(+0x5f5b)[0x7ffff4544f5b]
00:48:23 [jenkins01:17237] [ 5] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-rte.so.20(+0x78e54)[0x7ffff79dae54]
00:48:23 [jenkins01:17237] [ 6] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-rte.so.20(orte_oob_base_set_addr+0x300)[0x7ffff79dab0d]
00:48:23 [jenkins01:17237] [ 7] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x53c)[0x7ffff76b6f2c]
00:48:23 [jenkins01:17237] [ 8] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-pal.so.20(+0x398c7)[0x7ffff765a8c7]
00:48:23 [jenkins01:17237] [ 9] /lib64/libpthread.so.0[0x3d698079d1]
00:48:23 [jenkins01:17237] [10] /lib64/libc.so.6(clone+0x6d)[0x3d690e8b6d]
00:48:23 [jenkins01:17237] *** End of error message ***
00:48:23 --------------------------------------------------------------------------
00:48:23 mpirun noticed that process rank 0 with PID 0 on node jenkins01 exited on signal 6 (Aborted).

The stack trace corresponding to some of the runs in the 2nd one are:

  • Version: v2.x-dev-1304-g1005306
  • Configure: "CFLAGS=-g -pipe" --enable-picky --enable-debug --enable-mpirun-prefix-by-default --enable-mpi-cxx --disable-dlopen --enable-mpi-thread-multiple

These runs get pretty much the same output:

  • mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl sm,tcp,self dynamic/loop_spawn
  • mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl vader,tcp,self dynamic/loop_spawn
  • mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl tcp,self --mca mpi_leave_pinned 1 dynamic/loop_spawn
  • mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl tcp,self --mca mpi_leave_pinned_pipeline 1 dynamic/loop_spawn
[...lots of output...]
opal_mutex_unlock: Operation not permitted
[mpi012:05235] *** Process received signal ***
[mpi012:05235] Signal: Aborted (6)
[mpi012:05235] Signal code:  (-6)[mpi012:05235] [ 0] /lib64/libpthread.so.0[0x3e20a0f710]
[mpi012:05235] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3e20632925]
[mpi012:05235] [ 2] /lib64/libc.so.6(abort+0x175)[0x3e20634105]
[mpi012:05235] [ 3] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57164)[0x2aaaab3f8164]
[mpi012:05235] [ 4] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57232)[0x2aaaab3f8232]
[mpi012:05235] [ 5] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_buffer+0xc7)[0x2aaaab3f85df]
[mpi012:05235] [ 6] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_sizet+0x9a)[0x2aaaab3f9176][mpi012:05235] [ 7] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_buffer+0xf3)[0x2aaaab3f860b]
[mpi012:05235] [ 8] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack+0x199)[0x2aaaab3f84fa]
[mpi012:05235] [ 9] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_sec_base_validate+0x15e)[0x2aaaab54e175]
[mpi012:05235] [10] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_peer_recv_connect_ack+0xa6d)[0x2aaaab0bcc97]
[mpi012:05235] [11] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_recv_handler+0xc5)[0x2aaaab0bf78d]
[mpi012:05235] [12] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf716b)[0x2aaaab49816b]
[mpi012:05235] [13] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf727a)[0x2aaaab49827a]
[mpi012:05235] [14] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf7547)[0x2aaaab498547]
[mpi012:05235] [15] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab498b9a]
[mpi012:05235] [16] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x4d01c)[0x2aaaab3ee01c]
[mpi012:05235] [17] /lib64/libpthread.so.0[0x3e20a079d1]
[mpi012:05235] [18] /lib64/libc.so.6(clone+0x6d)[0x3e206e8b6d]
[mpi012:05235] *** End of error message
***--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi012 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Another set of runs that generate a mutex warning (perhaps these can shed a clue?):

  • Version: v2.x-dev-1304-g1005306 (same as above)
  • Configure: "CFLAGS=-g -pipe" --enable-picky --enable-debug --enable-mpirun-prefix-by-default --enable-mpi-cxx --disable-dlopen --enable-mpi-thread-multiple (same as above)

Runs with similar output:

  • Run: mpirun --oversubscribe -np 32 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl sm,tcp,self test_acc3
  • Run: mpirun --oversubscribe -np 32 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl vader,tcp,self test_acc3
================ test_acc3 ========== Tue Apr 26 07:20:54 2016
opal_mutex_lock(): Resource deadlock avoided
[mpi003:29881] *** Process received signal ***
[mpi003:29881] Signal: Aborted (6)
[mpi003:29881] Signal code:  (-6)[mpi003:29881] [ 0] /lib64/libpthread.so.0[0x37ca40f710]
[mpi003:29881] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x37ca032925]
[mpi003:29881] [ 2] /lib64/libc.so.6(abort+0x175)[0x37ca034105]
[mpi003:29881] [ 3] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x2093d3)[0x2aaaaacb63d3]
[mpi003:29881] [ 4] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20b21f)[0x2aaaaacb821f]
[mpi003:29881] [ 5] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_recv_request_progress_match+0x222)[0x2aaaaacba432]
[mpi003:29881] [ 6] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_recv_req_start+0x3b4)[0x2aaaaacbb132]
[mpi003:29881] [ 7] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_start+0x2c7)[0x2aaaaacc1503]
[mpi003:29881] [ 8] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_osc_pt2pt_irecv_w_cb+0x106)[0x2aaaaac96d67]
[mpi003:29881] [ 9] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x1e7606)[0x2aaaaac94606]
[mpi003:29881] [10] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_osc_pt2pt_progress_pending_acc+0x116)[0x2aaaaac9509b]
[mpi003:29881] [11] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x1e56bb)[0x2aaaaac926bb]
[mpi003:29881] [12] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x1e6f1b)[0x2aaaaac93f1b]
[mpi003:29881] [13] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20a6fc)[0x2aaaaacb76fc]
[mpi003:29881] [14] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20b325)[0x2aaaaacb8325]
[mpi003:29881] [15] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20b39b)[0x2aaaaacb839b]
[mpi003:29881] [16] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20bdcf)[0x2aaaaacb8dcf]
[mpi003:29881] [17] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_recv_frag_callback_fin+0x74)[0x2aaaaacb5571]
[mpi003:29881] [18] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xe8276)[0x2aaaab489276]
[mpi003:29881] [19] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf716b)[0x2aaaab49816b]
[mpi003:29881] [20] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf727a)[0x2aaaab49827a]
[mpi003:29881] [21] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf7547)[0x2aaaab498547]
[mpi003:29881] [22] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab498b9a]
[mpi003:29881] [23] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_progress+0x85)[0x2aaaab3e8ce2]
[mpi003:29881] [24] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x7477f)[0x2aaaaab2177f]
[mpi003:29881] [25] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_request_default_wait_all+0x1c6)[0x2aaaaab21de3]
[mpi003:29881] [26] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_coll_base_reduce_generic+0x48f)[0x2aaaaabbea2d]
[mpi003:29881] [27] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_coll_base_reduce_intra_binomial+0x187)[0x2aaaaabbf63d]
[mpi003:29881] [28] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_coll_tuned_reduce_intra_dec_fixed+0x244)[0x2aaaaabf453f]
[mpi003:29881] [29] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_coll_basic_reduce_scatter_block_intra+0x13e)[0x2aaaaabce550] 
[mpi003:29881] *** End of error message ***
[mpi018][[53932,1],18][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
[mpi018][[53932,1],29][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi003 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

More Cisco runs (same version / configure):

  • mpirun --oversubscribe -np 32 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl sm,tcp,self datatype/bottom
opal_mutex_unlock: Operation not permitted
[mpi012:29541] *** Process received signal ***
[mpi012:29541] Signal: Aborted (6)
[mpi012:29541] Signal code:  (-6)[mpi012:29541] [ 0] /lib64/libpthread.so.0[0x3e20a0f710]
[mpi012:29541] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3e20632925]
[mpi012:29541] [ 2] /lib64/libc.so.6(abort+0x175)[0x3e20634105]
[mpi012:29541] [ 3] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57164)[0x2aaaab3f8164]
[mpi012:29541] [ 4] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57232)[0x2aaaab3f8232]
[mpi012:29541] [ 5] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_buffer+0xc7)[0x2aaaab3f85df]
[mpi012:29541] [ 6] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack+0x199)[0x2aaaab3f84fa]
[mpi012:29541] [ 7] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_sec_base_validate+0x3ac)[0x2aaaab54e3c3]
[mpi012:29541] [ 8] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_peer_recv_connect_ack+0xa6d)[0x2aaaab0bcc97]
[mpi012:29541] [ 9] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_recv_handler+0xc5)[0x2aaaab0bf78d]
[mpi012:29541] [10] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf716b)[0x2aaaab49816b]
[mpi012:29541] [11] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf727a)[0x2aaaab49827a]
[mpi012:29541] [12] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf7547)[0x2aaaab498547]
[mpi012:29541] [13] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab498b9a]
[mpi012:29541] [14] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x4d01c)[0x2aaaab3ee01c]
[mpi012:29541] [15] /lib64/libpthread.so.0[0x3e20a079d1]
[mpi012:29541] [16] /lib64/libc.so.6(clone+0x6d)[0x3e206e8b6d]
[mpi012:29541] *** End of error message ***[mpi012:29523] [[838,0],0] usock_peer_send_blocking: send() to socket 39 failed: Broken pipe (32)
[mpi012:29523] [[838,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 315
[mpi012:29523] [[838,0],0]-[[838,1],12] usock_peer_accept: usock_peer_send_connect_ack failed
[mpi012:29523] [[838,0],0] usock_peer_send_blocking: send() to socket 30 failed: Broken pipe (32)
[mpi012:29523] [[838,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 315
[mpi012:29523] [[838,0],0]-[[838,1],11] usock_peer_accept: usock_peer_send_connect_ack failed
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 0 on node mpi012 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions