Closed
Description
Seeing this error in multiple v2.x runs, in different contexts. Example:
- README: update list of frameworks ompi-release#1096 (comment)
- https://mtt.open-mpi.org/index.php?do_redir=2298 (search for the string -- it may be in the middle)
This seems to indicate that we have an errant mutex unlock somewhere. The stack traces are varied. For example, the stack trace corresponding to the first one is:
00:48:23 + taskset -c 16,17 timeout -s SIGSEGV 10m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/thread_tests/thread-tests-1.1/latency_th 8
00:48:23 opal_mutex_unlock: Operation not permitted
00:48:23 [jenkins01:17237] *** Process received signal ***
00:48:23 [jenkins01:17237] Signal: Aborted (6)
00:48:23 [jenkins01:17237] Signal code: (-6)
00:48:23 [jenkins01:17237] [ 0] /lib64/libpthread.so.0[0x3d6980f710]
00:48:23 [jenkins01:17237] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3d69032925]
00:48:23 [jenkins01:17237] [ 2] /lib64/libc.so.6(abort+0x175)[0x3d69034105]
00:48:23 [jenkins01:17237] [ 3] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/openmpi/mca_oob_ud.so(+0x3efb)[0x7ffff4542efb]
00:48:23 [jenkins01:17237] [ 4] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/openmpi/mca_oob_ud.so(+0x5f5b)[0x7ffff4544f5b]
00:48:23 [jenkins01:17237] [ 5] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-rte.so.20(+0x78e54)[0x7ffff79dae54]
00:48:23 [jenkins01:17237] [ 6] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-rte.so.20(orte_oob_base_set_addr+0x300)[0x7ffff79dab0d]
00:48:23 [jenkins01:17237] [ 7] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x53c)[0x7ffff76b6f2c]
00:48:23 [jenkins01:17237] [ 8] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace-2/ompi_install1/lib/libopen-pal.so.20(+0x398c7)[0x7ffff765a8c7]
00:48:23 [jenkins01:17237] [ 9] /lib64/libpthread.so.0[0x3d698079d1]
00:48:23 [jenkins01:17237] [10] /lib64/libc.so.6(clone+0x6d)[0x3d690e8b6d]
00:48:23 [jenkins01:17237] *** End of error message ***
00:48:23 --------------------------------------------------------------------------
00:48:23 mpirun noticed that process rank 0 with PID 0 on node jenkins01 exited on signal 6 (Aborted).
The stack trace corresponding to some of the runs in the 2nd one are:
- Version: v2.x-dev-1304-g1005306
- Configure:
"CFLAGS=-g -pipe" --enable-picky --enable-debug --enable-mpirun-prefix-by-default --enable-mpi-cxx --disable-dlopen --enable-mpi-thread-multiple
These runs get pretty much the same output:
mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl sm,tcp,self dynamic/loop_spawn
mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl vader,tcp,self dynamic/loop_spawn
mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl tcp,self --mca mpi_leave_pinned 1 dynamic/loop_spawn
mpirun --oversubscribe -np 1 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl tcp,self --mca mpi_leave_pinned_pipeline 1 dynamic/loop_spawn
[...lots of output...]
opal_mutex_unlock: Operation not permitted
[mpi012:05235] *** Process received signal ***
[mpi012:05235] Signal: Aborted (6)
[mpi012:05235] Signal code: (-6)[mpi012:05235] [ 0] /lib64/libpthread.so.0[0x3e20a0f710]
[mpi012:05235] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3e20632925]
[mpi012:05235] [ 2] /lib64/libc.so.6(abort+0x175)[0x3e20634105]
[mpi012:05235] [ 3] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57164)[0x2aaaab3f8164]
[mpi012:05235] [ 4] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57232)[0x2aaaab3f8232]
[mpi012:05235] [ 5] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_buffer+0xc7)[0x2aaaab3f85df]
[mpi012:05235] [ 6] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_sizet+0x9a)[0x2aaaab3f9176][mpi012:05235] [ 7] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_buffer+0xf3)[0x2aaaab3f860b]
[mpi012:05235] [ 8] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack+0x199)[0x2aaaab3f84fa]
[mpi012:05235] [ 9] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_sec_base_validate+0x15e)[0x2aaaab54e175]
[mpi012:05235] [10] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_peer_recv_connect_ack+0xa6d)[0x2aaaab0bcc97]
[mpi012:05235] [11] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_recv_handler+0xc5)[0x2aaaab0bf78d]
[mpi012:05235] [12] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf716b)[0x2aaaab49816b]
[mpi012:05235] [13] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf727a)[0x2aaaab49827a]
[mpi012:05235] [14] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf7547)[0x2aaaab498547]
[mpi012:05235] [15] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab498b9a]
[mpi012:05235] [16] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x4d01c)[0x2aaaab3ee01c]
[mpi012:05235] [17] /lib64/libpthread.so.0[0x3e20a079d1]
[mpi012:05235] [18] /lib64/libc.so.6(clone+0x6d)[0x3e206e8b6d]
[mpi012:05235] *** End of error message
***--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi012 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Another set of runs that generate a mutex warning (perhaps these can shed a clue?):
- Version: v2.x-dev-1304-g1005306 (same as above)
- Configure:
"CFLAGS=-g -pipe" --enable-picky --enable-debug --enable-mpirun-prefix-by-default --enable-mpi-cxx --disable-dlopen --enable-mpi-thread-multiple
(same as above)
Runs with similar output:
- Run:
mpirun --oversubscribe -np 32 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl sm,tcp,self test_acc3
- Run:
mpirun --oversubscribe -np 32 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl vader,tcp,self test_acc3
================ test_acc3 ========== Tue Apr 26 07:20:54 2016
opal_mutex_lock(): Resource deadlock avoided
[mpi003:29881] *** Process received signal ***
[mpi003:29881] Signal: Aborted (6)
[mpi003:29881] Signal code: (-6)[mpi003:29881] [ 0] /lib64/libpthread.so.0[0x37ca40f710]
[mpi003:29881] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x37ca032925]
[mpi003:29881] [ 2] /lib64/libc.so.6(abort+0x175)[0x37ca034105]
[mpi003:29881] [ 3] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x2093d3)[0x2aaaaacb63d3]
[mpi003:29881] [ 4] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20b21f)[0x2aaaaacb821f]
[mpi003:29881] [ 5] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_recv_request_progress_match+0x222)[0x2aaaaacba432]
[mpi003:29881] [ 6] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_recv_req_start+0x3b4)[0x2aaaaacbb132]
[mpi003:29881] [ 7] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_start+0x2c7)[0x2aaaaacc1503]
[mpi003:29881] [ 8] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_osc_pt2pt_irecv_w_cb+0x106)[0x2aaaaac96d67]
[mpi003:29881] [ 9] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x1e7606)[0x2aaaaac94606]
[mpi003:29881] [10] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_osc_pt2pt_progress_pending_acc+0x116)[0x2aaaaac9509b]
[mpi003:29881] [11] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x1e56bb)[0x2aaaaac926bb]
[mpi003:29881] [12] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x1e6f1b)[0x2aaaaac93f1b]
[mpi003:29881] [13] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20a6fc)[0x2aaaaacb76fc]
[mpi003:29881] [14] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20b325)[0x2aaaaacb8325]
[mpi003:29881] [15] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20b39b)[0x2aaaaacb839b]
[mpi003:29881] [16] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x20bdcf)[0x2aaaaacb8dcf]
[mpi003:29881] [17] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_pml_ob1_recv_frag_callback_fin+0x74)[0x2aaaaacb5571]
[mpi003:29881] [18] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xe8276)[0x2aaaab489276]
[mpi003:29881] [19] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf716b)[0x2aaaab49816b]
[mpi003:29881] [20] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf727a)[0x2aaaab49827a]
[mpi003:29881] [21] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf7547)[0x2aaaab498547]
[mpi003:29881] [22] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab498b9a]
[mpi003:29881] [23] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_progress+0x85)[0x2aaaab3e8ce2]
[mpi003:29881] [24] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(+0x7477f)[0x2aaaaab2177f]
[mpi003:29881] [25] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_request_default_wait_all+0x1c6)[0x2aaaaab21de3]
[mpi003:29881] [26] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_coll_base_reduce_generic+0x48f)[0x2aaaaabbea2d]
[mpi003:29881] [27] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_coll_base_reduce_intra_binomial+0x187)[0x2aaaaabbf63d]
[mpi003:29881] [28] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(ompi_coll_tuned_reduce_intra_dec_fixed+0x244)[0x2aaaaabf453f]
[mpi003:29881] [29] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libmpi.so.20(mca_coll_basic_reduce_scatter_block_intra+0x13e)[0x2aaaaabce550]
[mpi003:29881] *** End of error message ***
[mpi018][[53932,1],18][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
[mpi018][[53932,1],29][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv
failed: Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node mpi003 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
More Cisco runs (same version / configure):
mpirun --oversubscribe -np 32 --mca orte_startup_timeout 10000 --mca coll ^ml --mca btl sm,tcp,self datatype/bottom
opal_mutex_unlock: Operation not permitted
[mpi012:29541] *** Process received signal ***
[mpi012:29541] Signal: Aborted (6)
[mpi012:29541] Signal code: (-6)[mpi012:29541] [ 0] /lib64/libpthread.so.0[0x3e20a0f710]
[mpi012:29541] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x3e20632925]
[mpi012:29541] [ 2] /lib64/libc.so.6(abort+0x175)[0x3e20634105]
[mpi012:29541] [ 3] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57164)[0x2aaaab3f8164]
[mpi012:29541] [ 4] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x57232)[0x2aaaab3f8232]
[mpi012:29541] [ 5] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack_buffer+0xc7)[0x2aaaab3f85df]
[mpi012:29541] [ 6] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_dss_unpack+0x199)[0x2aaaab3f84fa]
[mpi012:29541] [ 7] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_sec_base_validate+0x3ac)[0x2aaaab54e3c3]
[mpi012:29541] [ 8] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_peer_recv_connect_ack+0xa6d)[0x2aaaab0bcc97]
[mpi012:29541] [ 9] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-rte.so.20(mca_oob_usock_recv_handler+0xc5)[0x2aaaab0bf78d]
[mpi012:29541] [10] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf716b)[0x2aaaab49816b]
[mpi012:29541] [11] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf727a)[0x2aaaab49827a]
[mpi012:29541] [12] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0xf7547)[0x2aaaab498547]
[mpi012:29541] [13] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x298)[0x2aaaab498b9a]
[mpi012:29541] [14] /home/mpiteam/scratches/community/2016-04-25cron/1sKf/installs/cC6L/install/lib/libopen-pal.so.20(+0x4d01c)[0x2aaaab3ee01c]
[mpi012:29541] [15] /lib64/libpthread.so.0[0x3e20a079d1]
[mpi012:29541] [16] /lib64/libc.so.6(clone+0x6d)[0x3e206e8b6d]
[mpi012:29541] *** End of error message ***[mpi012:29523] [[838,0],0] usock_peer_send_blocking: send() to socket 39 failed: Broken pipe (32)
[mpi012:29523] [[838,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 315
[mpi012:29523] [[838,0],0]-[[838,1],12] usock_peer_accept: usock_peer_send_connect_ack failed
[mpi012:29523] [[838,0],0] usock_peer_send_blocking: send() to socket 30 failed: Broken pipe (32)
[mpi012:29523] [[838,0],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 315
[mpi012:29523] [[838,0],0]-[[838,1],11] usock_peer_accept: usock_peer_send_connect_ack failed
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 0 on node mpi012 exited on signal 6 (Aborted).
--------------------------------------------------------------------------