-
Notifications
You must be signed in to change notification settings - Fork 146
Description
My application produces a ~25 TB ADIOS2 (2.10.2) checkpoint in a test run with 512 GPU nodes.
With the same settings (ADIOS2 at defaults), checkpoint writing is much slower on LUMI-G than Frontier. Both systems share similar architecture with AMD MI250X GPUs so the computing performance is more or less identical, but the key differences regarding IO seem to be the Lustre OST count. lfs df -h . on Scratch reports 1349 OSTs on Frontier and only 32 on LUMI-G.
ADIOS2 by default tries to write 1 subfile per node (so 512 subfiles for 512 nodes), which likely overloads LUMI-G's 32 OSTs. Changing striping size (from default 1 MB) to 1G or 2G didn't help much.
I then tried to reduced the number of subfiles wirtten by ADIOS2 setting SubStreams to 32, but the checkpoint still comes out with 512 subfiles rather than 32.
Could I get some advice on:
- Why my XML below fails to produce 32 subfiles?
- Are there recommended ADIOS2 configurations for multi-TB outputs on systems with few OSTs like LUMI-G?
- Any Lustre best practices (like striping) we should pair with the ADIOS2 settings?
I did contact LUMI-G support, but understandably it seems that they can't offer application-specific tuning so I'll need to determine the optimal settings myself.
1 <?xml version="1.0"?>
2 <!-- XGC Config XML file for ADIOS2 -->
3
4 <adios-config>
5
6 <!--==========================================
7 Configuration for the Particle Writer
8 ==========================================-->
9 <!--
10 <io name="particles">
11 <engine type="BPFile">
12 <parameter key="collectivemetadata" value="off"/>
13 <parameter key="Profile" value="Off"/>
14 </engine>
15 </io>
16
17 <io name="restart">
18 <engine type="BPFile">
19 <parameter key="collectivemetadata" value="off"/>
20 <parameter key="Profile" value="Off"/>
21 <parameter key="SubStreams" value="1024"/>
22 </engine>
23 </io>
24
25 <io name="restartf0">
26 <engine type="BPFile">
27 <parameter key="collectivemetadata" value="off"/>
28 <parameter key="Profile" value="Off"/>
29 </engine>
30 </io>
31 -->
32
33 <io name="restart">
34 <engine type="BPFile">
35 <parameter key="collectivemetadata" value="off"/>
36 <parameter key="Profile" value="Off"/>
37 <parameter key="SubStreams" value="32"/>
38 </engine>
39 </io>
40
41 </adios-config>