-
Notifications
You must be signed in to change notification settings - Fork 146
Closed
Labels
Description
Abstract out the BP5Writer metadata aggregation code into somewhere we can use it in other BP5 engines and potentially experiment with more enhancements.
Notes: The overall "metadata" actually has 3 components generated on every rank: attributes, meta-metadata and metadata. BP5 works fine if all of these components are stored and delivered to the reader. However, there are optimizations possible.
- The meta-metadata consists of a list metametadata elements, each of which is an (ID, body) pair. The body will be the same across ranks and timesteps when the same variables are written across ranks and timesteps. The ID is a hash of the body, so we only need to deliver unique instances of these pairs. The elimination of duplicates happens on writer rank 0 in SST, and in the middle of two-step metadata aggregation in BP5Writer (which helps avoid 32-bit aggregation limit in MPI). We could actually aggregate the IDs first and then decide who actually sends the bodies as part of aggregation, which would provide more reduction than either BP5writer or SST.
- ADIOS semantics say that attributes defined by any rank must be available to the reader, and some applications define the same attributes on every rank resulting in N identical attribute blocks. We could detect this situation by handling them somewhat like metametadata, doing a hash on them and then aggregating only the unique ones. This would work for many common situations (but would be defeated by applications doing things like including rank numbers in their attribute names). Is it easier just to educate users that you shouldn't define thousands of attributes on thousands or ranks?
- actual metadata blocks must all eventually be delivered to the reader, but for large metadata and large numbers of ranks, the 32-bit limitation in MPI gather might be a problem. We could fall back to piece-wise aggregation in this circumstance.
Final note: The MPI 32-bit limit is an element count in a gather. One can aggregate 8 or 16 times the amount of data using MPI_DOUBLE or MPI_COMPLEX if the things aggregated are suitably sized.
franzpoeschel