Skip to content

ADIOS2 v2.7.1: Spack Build on Summit Breaks at scale #2846

@ax3l

Description

@ax3l

After #2836 was fixed, we observe new problems on Summit with MPI at scale. The problems do not show up with a few dozen nodes but are visible over at least 224 nodes or up.

The corresponding OLCF ticket is OLCFHELP-3545.

We see the problem both with the gcc-9.3.0 module adios2/2.7.1 from system as well as a self-built module without SST.

Error message:

 0: ./warpx_sp() [0x104cd82c]
    amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at ??:?

 1: ./warpx_sp() [0x104d0070]
    amrex::BLBackTrace::handler(int) at ??:?

 2: linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x2000000504d8]
    ?? ??:0

 3: /lib64/power9/libc.so.6(gsignal+0xd8) [0x20000a733618]

 4: /lib64/power9/libc.so.6(abort+0x164) [0x20000a713a2c]

 5: /lib64/power9/libc.so.6(+0x36f70) [0x20000a726f70]

 6: /lib64/power9/libc.so.6(__assert_fail+0x64) [0x20000a727014]

 7: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(_ZN4CCMI9Protocols7Barrier26MultiLeaderBarrierFactoryTINS1_19MultiLeaderBarrierTIN7LibColl10Interfaces15NativeInterfaceELNS4_15topologyIndex_tE0EEENS_17ConnectionManager13SimpleConnMgrEE8cb_asyncEPvSC_PKvjjmPPNS4_13PipeWorkQueueEPPFvSC_SC_16libcoll_result_tEPSC_+0x39c) [0x20000f350c7c]
    ?? ??:0

 8: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(_ZN7LibColl18NativeInterfaceP2PILb1ELb0EE20dispatch_mcast_shortEPvS2_PKvmS4_mjP11pami_recv_t+0x80) [0x20000f3bd3d0]
    ?? ??:0

 9: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE1EE15dispatch_packedEPvSP_mSP_SP_+0x4c) [0x20000ed0255c]
    ?? ??:0

10: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libpami.so.3(PAMI_Context_advancev+0x6a0) [0x20000ecf9ec0]
    ?? ??:0

11: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(LIBCOLL_Advance_pami+0x34) [0x20000f2efb14]
    ?? ??:0

12: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(LIBCOLL_Advance+0x18) [0x20000f2da128]
    ?? ??:0

13: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/spectrum_mpi/mca_coll_ibm.so(start_libcoll_blocking_collective+0x120) [0x20000f1efe50]
    ?? ??:0

14: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/spectrum_mpi/mca_coll_ibm.so(mca_coll_ibm_barrier+0x70) [0x20000f1f51d0]
    ?? ??:0

15: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/libmpi_ibm.so.3(MPI_Barrier+0x104) [0x20000a100b34]
    ?? ??:0

16: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core_mpi.so.2(_ZNK6adios26helper11CommImplMPI7BarrierERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x24) [0x20000a947e34]
    ?? ??:0

17: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZNK6adios26helper4Comm7BarrierERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x2c) [0x20000acd9f4c]
    ?? ??:0

18: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios212transportman12TransportMan13MkDirsBarrierERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS8_EERKS2_ISt3mapIS8_S8_St4lessIS8_ESaISt4pairIKS8_S8_EEESaISK_EEb+0x10c) [0x20000af8357c]
    ?? ??:0

19: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core6engine9BP4Writer14InitTransportsEv+0x25c) [0x20000ad4132c]
    ?? ??:0

20: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core6engine9BP4Writer4InitEv+0x78) [0x20000ad41ee8]
    ?? ??:0

21: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core6engine9BP4WriterC2ERNS0_2IOERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeENS_6helper4CommE+0x308) [0x20000ad42228]
    ?? ??:0

22: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core2IO10MakeEngineINS0_6engine9BP4WriterEEESt10shared_ptrINS0_6EngineEERS1_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeENS_6helper4CommE+0xa0) [0x20000ac32be0]
    ?? ??:0

23: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core_mpi.so.2(_ZNSt17_Function_handlerIFSt10shared_ptrIN6adios24core6EngineEERNS2_2IOERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS1_4ModeENS1_6helper4CommEEPSI_E9_M_invokeERKSt9_Any_dataS6_SE_OSF_OSH_+0x74) [0x20000a9458a4]
    ?? ??:0

24: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core2IO4OpenERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeENS_6helper4CommE+0x540) [0x20000ac2c870]
    ?? ??:0

25: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core2IO4OpenERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeE+0x78) [0x20000ac2d418]
    ?? ??:0

26: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/libadios2_cxx11.so.2(_ZN6adios22IO4OpenERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeE+0x168) [0x200009f41498]
    ?? ??:0

27: ./warpx_sp() [0x1067ac80]
    openPMD::detail::BufferedActions::getEngine() at ??:?

28: ./warpx_sp() [0x10689e14]
    openPMD::detail::BufferedActions::configure_IO(openPMD::ADIOS2IOHandlerImpl&) at ??:?

29: ./warpx_sp() [0x1068bc14]
    openPMD::detail::BufferedActions::BufferedActions(openPMD::ADIOS2IOHandlerImpl&, openPMD::InvalidatableFile) at ??:?

30: ./warpx_sp() [0x1068c07c]
    openPMD::ADIOS2IOHandlerImpl::getFileData(openPMD::InvalidatableFile, openPMD::ADIOS2IOHandlerImpl::IfFileNotOpen) at ??:?

31: ./warpx_sp() [0x1068cad4]
    openPMD::ADIOS2IOHandlerImpl::createFile(openPMD::Writable*, openPMD::Parameter<(openPMD::Operation)0> const&) at ??:?

32: ./warpx_sp() [0x10652d40]
    openPMD::AbstractIOHandlerImpl::flush() at ??:?

33: ./warpx_sp() [0x1068679c]
    openPMD::ADIOS2IOHandlerImpl::flush() at ??:?

34: ./warpx_sp() [0x10686944]
    openPMD::ADIOS2IOHandler::flush() at ??:?

35: ./warpx_sp() [0x105e8384]
    openPMD::SeriesInterface::flushFileBased(std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >) at ??:?

36: ./warpx_sp() [0x105e9320]
    openPMD::SeriesInterface::flush_impl(std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, openPMD::FlushLevel, bool) at ??:?

37: ./warpx_sp() [0x105e98f8]
    openPMD::SeriesInterface::advance(openPMD::AdvanceMode, openPMD::internal::AttributableData&, std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, openPMD::Iteration&) at ??:?

38: ./warpx_sp() [0x105a4ca8]
    openPMD::Iteration::beginStep() at ??:?

39: ./warpx_sp() [0x105efad0]
    openPMD::WriteIterations::operator[](unsigned long&&) at ??:?

40: ./warpx_sp() [0x1013f834]
    WarpXOpenPMDPlot::WriteOpenPMDFieldsAll(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amrex::Vector<amrex::MultiFab, std::allocator<amrex::MultiFab> > const&, amrex::Vector<amrex::Geometry, std::allocator<amrex::Geometry> >&, int, double, bool, amrex::Geometry const&) const at ??:?

41: ./warpx_sp() [0x101be97c]
    FlushFormatOpenPMD::WriteToFile(amrex::Vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, amrex::Vector<amrex::MultiFab, std::allocator<amrex::MultiFab> > const&, amrex::Vector<amrex::Geometry, std::allocator<amrex::Geometry> >&, amrex::Vector<int, std::allocator<int> >, double, amrex::Vector<ParticleDiag, std::allocator<ParticleDiag> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool, bool, int, amrex::Geometry const&, bool) const at ??:?

42: ./warpx_sp() [0x100f21c8]
    FullDiagnostics::Flush(int) at ??:?

43: ./warpx_sp() [0x100effa4]
    Diagnostics::FilterComputePackFlush(int, bool) at ??:?

44: ./warpx_sp() [0x100f7ed0]
    MultiDiagnostics::FilterComputePackFlush(int, bool) at ??:?

45: ./warpx_sp() [0x102a4a94]
    WarpX::InitData() at ??:?

46: ./warpx_sp() [0x10035d38]
    main at ??:?

47: /lib64/power9/libc.so.6(+0x24078) [0x20000a714078]

48: /lib64/power9/libc.so.6(__libc_start_main+0xb4) [0x20000a714264]

cc @mpbelhorn @chuckatkins @pnorbert

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions