Open
Description
Background information
Package: Open MPI mtt@hpc-arm-01 Distribution
Open MPI: 2.1.2a1
Open MPI repo revision: v2.1.1-55-g4d82554
Open MPI release date: Unreleased developer copy
Open RTE: 2.1.2a1
Open RTE repo revision: v2.1.1-55-g4d82554
Open RTE release date: Unreleased developer copy
OPAL: 2.1.2a1
OPAL repo revision: v2.1.1-55-g4d82554
OPAL release date: Unreleased developer copy
- Operating system/version: RedHat 7.2
- Computer hardware: aarch64 (ARM 64-bit)
- Network type: Infiniband
Details of the problem
A hang in ctxalloc test during MPI_Init.
Similar hang is observed in many other tests.
ctxalloc can be found here
mpirun -np 192 -mca btl_openib_warn_default_gid_prefix 0 \
--bind-to core -mca pml ucx \
-x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 \
-x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 \
-mca mpi_add_procs_cutoff 100000 --map-by node \
ctxalloc 2 1500 100
stack trace:
Thread 1 (Thread 0x3ffb800a480 (LWP 27563)):
#0 0x000003ffb7ccfa74 in nanosleep () from /usr/lib64/libpthread.so.0
#1 0x000003ffb7911e0c in _opal_lifo_release_cpu () at ../opal/class/opal_lifo.h:195
#2 0x000003ffb7911e40 in opal_lifo_pop_atomic (lifo=0x620520) at ../opal/class/opal_lifo.h:210
#3 0x000003ffb7911fd0 in opal_free_list_get_st (flist=0x620520) at ../opal/class/opal_free_list.h:213
#4 0x000003ffb7911ff4 in opal_free_list_get (flist=0x620520) at ../opal/class/opal_free_list.h:225
#5 0x000003ffb7912264 in opal_rb_tree_init (tree=0x6204e0, comp=0x3ffb4671ae4 <mca_mpool_rb_hugepage_compare>) at class/opal_rb_tree.c:86
#6 0x000003ffb4671d44 in mca_mpool_hugepage_module_init (mpool=0x620410, huge_page=0x61ef30) at mpool_hugepage_module.c:107
#7 0x000003ffb4672a10 in mca_mpool_hugepage_open () at mpool_hugepage_component.c:166
#8 0x000003ffb7948ab4 in open_components (framework=0x3ffb7a34788 <opal_mpool_base_framework>) at mca_base_components_open.c:117
#9 0x000003ffb79489e4 in mca_base_framework_components_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_components_open.c:65
#10 0x000003ffb79bd2a0 in mca_mpool_base_open (flags=MCA_BASE_OPEN_DEFAULT) at base/mpool_base_frame.c:89
#11 0x000003ffb7957d1c in mca_base_framework_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:174
#12 0x000003ffb7d59bb8 in ompi_mpi_init (argc=4, argv=0x3ffffffdc38, requested=0, provided=0x3ffffffda2c) at runtime/ompi_mpi_init.c:589
#13 0x000003ffb7d990e0 in PMPI_Init (argc=0x3ffffffdaac, argv=0x3ffffffdaa0) at pinit.c:66
#14 0x0000000000400c5c in main (argc=4, argv=0x3ffffffdc38) at ctxalloc.c:20