-
Notifications
You must be signed in to change notification settings - Fork 146
Description
Describe the bug
ADIOS2 restart (opening bp file read-only) in PIXIE3D overwhelms memory and aborts with error:
std::bad_alloc
adios2_open_new_comm
Memory per MPI rank at fresh start is ~0.6 GB. Memory per MPI rank at restart (after ~4000 time steps on a 64x128x64 mesh) is > 4 GB, independently of the number of MPI ranks. 32 MPI ranks overwhelms node memory (128 GB) and code aborts.
Memory with fresh start with 16 ranks:

Memory at restart (from BP file with ~4000 time slices) with 16 ranks:

Memory at restart per MPI rank seems independent of the number of MPI ranks. See for 8 ranks:

To Reproduce
I am at latest master-branch commit ff8326c. Very hard to provide "simple" example as both PIXIE3D and (very large) PIXIE3D BP restart file are required. Seems like problem is related to the number of time steps in restart file, and therefore providing a smaller restart file will not be very helpful. I am happy to provide access to both upon request.
Expected behavior
Memory at restart should be inversely proportional to the number of MPI ranks to avoid overwhelming node memory.
Desktop (please complete the following information):
- OS/Platform: Linux ba170.localdomain 3.10.0-1160.45.1.1chaos.ch6.x86_64
- Build: gcc 9.4.0, openmpi 3.1.6, shared libs
Additional context
Seems like problem is related to the number of time steps stored in the restart file, as PIXIE3D has no issue restarting from BP files with the same underlying 3D mesh but fewer time steps.
Following up
Was the issue fixed? Please report back.