Skip to content

Hang in mca_mpool_hugepage_module_init() on ARM64 #3697

Open
@yosefe

Description

@yosefe

Background information

               Package: Open MPI mtt@hpc-arm-01 Distribution
                Open MPI: 2.1.2a1
  Open MPI repo revision: v2.1.1-55-g4d82554
   Open MPI release date: Unreleased developer copy
                Open RTE: 2.1.2a1
  Open RTE repo revision: v2.1.1-55-g4d82554
   Open RTE release date: Unreleased developer copy
                    OPAL: 2.1.2a1
      OPAL repo revision: v2.1.1-55-g4d82554
       OPAL release date: Unreleased developer copy
  • Operating system/version: RedHat 7.2
  • Computer hardware: aarch64 (ARM 64-bit)
  • Network type: Infiniband

Details of the problem

A hang in ctxalloc test during MPI_Init.
Similar hang is observed in many other tests.
ctxalloc can be found here

mpirun -np 192 -mca btl_openib_warn_default_gid_prefix 0 \
  --bind-to core -mca pml ucx \
  -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 \
  -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 \
   -mca mpi_add_procs_cutoff 100000 --map-by node \
ctxalloc 2 1500 100

stack trace:

Thread 1 (Thread 0x3ffb800a480 (LWP 27563)):
#0  0x000003ffb7ccfa74 in nanosleep () from /usr/lib64/libpthread.so.0
#1  0x000003ffb7911e0c in _opal_lifo_release_cpu () at ../opal/class/opal_lifo.h:195
#2  0x000003ffb7911e40 in opal_lifo_pop_atomic (lifo=0x620520) at ../opal/class/opal_lifo.h:210
#3  0x000003ffb7911fd0 in opal_free_list_get_st (flist=0x620520) at ../opal/class/opal_free_list.h:213
#4  0x000003ffb7911ff4 in opal_free_list_get (flist=0x620520) at ../opal/class/opal_free_list.h:225
#5  0x000003ffb7912264 in opal_rb_tree_init (tree=0x6204e0, comp=0x3ffb4671ae4 <mca_mpool_rb_hugepage_compare>) at class/opal_rb_tree.c:86
#6  0x000003ffb4671d44 in mca_mpool_hugepage_module_init (mpool=0x620410, huge_page=0x61ef30) at mpool_hugepage_module.c:107
#7  0x000003ffb4672a10 in mca_mpool_hugepage_open () at mpool_hugepage_component.c:166
#8  0x000003ffb7948ab4 in open_components (framework=0x3ffb7a34788 <opal_mpool_base_framework>) at mca_base_components_open.c:117
#9  0x000003ffb79489e4 in mca_base_framework_components_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_components_open.c:65
#10 0x000003ffb79bd2a0 in mca_mpool_base_open (flags=MCA_BASE_OPEN_DEFAULT) at base/mpool_base_frame.c:89
#11 0x000003ffb7957d1c in mca_base_framework_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:174
#12 0x000003ffb7d59bb8 in ompi_mpi_init (argc=4, argv=0x3ffffffdc38, requested=0, provided=0x3ffffffda2c) at runtime/ompi_mpi_init.c:589
#13 0x000003ffb7d990e0 in PMPI_Init (argc=0x3ffffffdaac, argv=0x3ffffffdaa0) at pinit.c:66
#14 0x0000000000400c5c in main (argc=4, argv=0x3ffffffdc38) at ctxalloc.c:20

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions