-
Notifications
You must be signed in to change notification settings - Fork 146
Closed
Description
After #2836 was fixed, we observe new problems on Summit with MPI at scale. The problems do not show up with a few dozen nodes but are visible over at least 224 nodes or up.
The corresponding OLCF ticket is OLCFHELP-3545.
We see the problem both with the gcc-9.3.0 module adios2/2.7.1 from system as well as a self-built module without SST.
Error message:
0: ./warpx_sp() [0x104cd82c]
amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at ??:?
1: ./warpx_sp() [0x104d0070]
amrex::BLBackTrace::handler(int) at ??:?
2: linux-vdso64.so.1(__kernel_sigtramp_rt64+0) [0x2000000504d8]
?? ??:0
3: /lib64/power9/libc.so.6(gsignal+0xd8) [0x20000a733618]
4: /lib64/power9/libc.so.6(abort+0x164) [0x20000a713a2c]
5: /lib64/power9/libc.so.6(+0x36f70) [0x20000a726f70]
6: /lib64/power9/libc.so.6(__assert_fail+0x64) [0x20000a727014]
7: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(_ZN4CCMI9Protocols7Barrier26MultiLeaderBarrierFactoryTINS1_19MultiLeaderBarrierTIN7LibColl10Interfaces15NativeInterfaceELNS4_15topologyIndex_tE0EEENS_17ConnectionManager13SimpleConnMgrEE8cb_asyncEPvSC_PKvjjmPPNS4_13PipeWorkQueueEPPFvSC_SC_16libcoll_result_tEPSC_+0x39c) [0x20000f350c7c]
?? ??:0
8: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(_ZN7LibColl18NativeInterfaceP2PILb1ELb0EE20dispatch_mcast_shortEPvS2_PKvmS4_mjP11pami_recv_t+0x80) [0x20000f3bd3d0]
?? ??:0
9: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE1EE15dispatch_packedEPvSP_mSP_SP_+0x4c) [0x20000ed0255c]
?? ??:0
10: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libpami.so.3(PAMI_Context_advancev+0x6a0) [0x20000ecf9ec0]
?? ??:0
11: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(LIBCOLL_Advance_pami+0x34) [0x20000f2efb14]
?? ??:0
12: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/pami_port/libcollectives.so.3(LIBCOLL_Advance+0x18) [0x20000f2da128]
?? ??:0
13: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/spectrum_mpi/mca_coll_ibm.so(start_libcoll_blocking_collective+0x120) [0x20000f1efe50]
?? ??:0
14: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/spectrum_mpi/mca_coll_ibm.so(mca_coll_ibm_barrier+0x70) [0x20000f1f51d0]
?? ??:0
15: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-2s7kpbzydf6val7k2d3e6cz3zdhtcwlw/container/../lib/libmpi_ibm.so.3(MPI_Barrier+0x104) [0x20000a100b34]
?? ??:0
16: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core_mpi.so.2(_ZNK6adios26helper11CommImplMPI7BarrierERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x24) [0x20000a947e34]
?? ??:0
17: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZNK6adios26helper4Comm7BarrierERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x2c) [0x20000acd9f4c]
?? ??:0
18: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios212transportman12TransportMan13MkDirsBarrierERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS8_EERKS2_ISt3mapIS8_S8_St4lessIS8_ESaISt4pairIKS8_S8_EEESaISK_EEb+0x10c) [0x20000af8357c]
?? ??:0
19: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core6engine9BP4Writer14InitTransportsEv+0x25c) [0x20000ad4132c]
?? ??:0
20: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core6engine9BP4Writer4InitEv+0x78) [0x20000ad41ee8]
?? ??:0
21: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core6engine9BP4WriterC2ERNS0_2IOERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeENS_6helper4CommE+0x308) [0x20000ad42228]
?? ??:0
22: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core2IO10MakeEngineINS0_6engine9BP4WriterEEESt10shared_ptrINS0_6EngineEERS1_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeENS_6helper4CommE+0xa0) [0x20000ac32be0]
?? ??:0
23: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core_mpi.so.2(_ZNSt17_Function_handlerIFSt10shared_ptrIN6adios24core6EngineEERNS2_2IOERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS1_4ModeENS1_6helper4CommEEPSI_E9_M_invokeERKSt9_Any_dataS6_SE_OSF_OSH_+0x74) [0x20000a9458a4]
?? ??:0
24: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core2IO4OpenERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeENS_6helper4CommE+0x540) [0x20000ac2c870]
?? ??:0
25: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/../lib64/libadios2_core.so.2(_ZN6adios24core2IO4OpenERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeE+0x78) [0x20000ac2d418]
?? ??:0
26: /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/adios2-2.7.1-yivst3ulepk672qvfiduywxkg4rk2qwn/lib64/libadios2_cxx11.so.2(_ZN6adios22IO4OpenERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4ModeE+0x168) [0x200009f41498]
?? ??:0
27: ./warpx_sp() [0x1067ac80]
openPMD::detail::BufferedActions::getEngine() at ??:?
28: ./warpx_sp() [0x10689e14]
openPMD::detail::BufferedActions::configure_IO(openPMD::ADIOS2IOHandlerImpl&) at ??:?
29: ./warpx_sp() [0x1068bc14]
openPMD::detail::BufferedActions::BufferedActions(openPMD::ADIOS2IOHandlerImpl&, openPMD::InvalidatableFile) at ??:?
30: ./warpx_sp() [0x1068c07c]
openPMD::ADIOS2IOHandlerImpl::getFileData(openPMD::InvalidatableFile, openPMD::ADIOS2IOHandlerImpl::IfFileNotOpen) at ??:?
31: ./warpx_sp() [0x1068cad4]
openPMD::ADIOS2IOHandlerImpl::createFile(openPMD::Writable*, openPMD::Parameter<(openPMD::Operation)0> const&) at ??:?
32: ./warpx_sp() [0x10652d40]
openPMD::AbstractIOHandlerImpl::flush() at ??:?
33: ./warpx_sp() [0x1068679c]
openPMD::ADIOS2IOHandlerImpl::flush() at ??:?
34: ./warpx_sp() [0x10686944]
openPMD::ADIOS2IOHandler::flush() at ??:?
35: ./warpx_sp() [0x105e8384]
openPMD::SeriesInterface::flushFileBased(std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >) at ??:?
36: ./warpx_sp() [0x105e9320]
openPMD::SeriesInterface::flush_impl(std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, openPMD::FlushLevel, bool) at ??:?
37: ./warpx_sp() [0x105e98f8]
openPMD::SeriesInterface::advance(openPMD::AdvanceMode, openPMD::internal::AttributableData&, std::_Rb_tree_iterator<std::pair<unsigned long const, openPMD::Iteration> >, openPMD::Iteration&) at ??:?
38: ./warpx_sp() [0x105a4ca8]
openPMD::Iteration::beginStep() at ??:?
39: ./warpx_sp() [0x105efad0]
openPMD::WriteIterations::operator[](unsigned long&&) at ??:?
40: ./warpx_sp() [0x1013f834]
WarpXOpenPMDPlot::WriteOpenPMDFieldsAll(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amrex::Vector<amrex::MultiFab, std::allocator<amrex::MultiFab> > const&, amrex::Vector<amrex::Geometry, std::allocator<amrex::Geometry> >&, int, double, bool, amrex::Geometry const&) const at ??:?
41: ./warpx_sp() [0x101be97c]
FlushFormatOpenPMD::WriteToFile(amrex::Vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, amrex::Vector<amrex::MultiFab, std::allocator<amrex::MultiFab> > const&, amrex::Vector<amrex::Geometry, std::allocator<amrex::Geometry> >&, amrex::Vector<int, std::allocator<int> >, double, amrex::Vector<ParticleDiag, std::allocator<ParticleDiag> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool, bool, int, amrex::Geometry const&, bool) const at ??:?
42: ./warpx_sp() [0x100f21c8]
FullDiagnostics::Flush(int) at ??:?
43: ./warpx_sp() [0x100effa4]
Diagnostics::FilterComputePackFlush(int, bool) at ??:?
44: ./warpx_sp() [0x100f7ed0]
MultiDiagnostics::FilterComputePackFlush(int, bool) at ??:?
45: ./warpx_sp() [0x102a4a94]
WarpX::InitData() at ??:?
46: ./warpx_sp() [0x10035d38]
main at ??:?
47: /lib64/power9/libc.so.6(+0x24078) [0x20000a714078]
48: /lib64/power9/libc.so.6(__libc_start_main+0xb4) [0x20000a714264]
Metadata
Metadata
Assignees
Labels
No labels