@@ -6,6 +6,17 @@ User-Level Fault Mitigation (ULFM)
6
6
This chapter documents the features and options specific to the **User
7
7
Level Failure Mitigation (ULFM) ** Open MPI implementation.
8
8
9
+ TL;DR
10
+ -----
11
+ This is an extremely terse summary of how to use ULFM:
12
+
13
+ .. code-block ::
14
+
15
+ ./configure --with-ft=ulfm [...options...]
16
+ make [-j N] all install
17
+ mpicc my-ft-program.c -o my-ft-program
18
+ mpiexec -n 4 --with-ft ulfm my-ft-program
19
+
9
20
Features
10
21
--------
11
22
@@ -100,11 +111,12 @@ Available from: https://journals.sagepub.com/doi/10.1177/1094342013488238.
100
111
Building ULFM support in Open MPI
101
112
---------------------------------
102
113
103
- In Open MPI |ompi_ver |, ULFM support is **enabled by default ** |mdash |
104
- when you build Open MPI, unless you specify ``--without-ft ``, ULFM
105
- support will automatically be built.
114
+ In Open MPI |ompi_ver |, ULFM support is **built-in by default ** |mdash |
115
+ that is, when you build Open MPI, unless you specify ``--without-ft ``, ULFM
116
+ support is automatically available (but is inactive unless enabled at
117
+ runtime).
106
118
107
- Optionally, you can specify ``--with-ft `` to ensure that ULFM support
119
+ Optionally, you can specify ``--with-ft ulfm `` to ensure that ULFM support
108
120
is definitely built.
109
121
110
122
Support notes
@@ -114,89 +126,6 @@ Support notes
114
126
that if you are going to use ULFM, you should disable building
115
127
OpenSHMEM with ``--disable-oshmem ``.
116
128
117
- * SLURM is tested and supported with fault tolerance.
118
-
119
- .. important :: Do not use ``srun``, or your application gets killed
120
- by the scheduler upon the first failure. Instead,
121
- use ``mpirun `` in an ``salloc/sbatch `` allocation.
122
-
123
- * LSF is untested with fault tolerance.
124
-
125
- * PBS/Torque is tested and supported with fault tolerance.
126
-
127
- .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
128
-
129
- Modified, Untested and Disabled Components
130
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131
-
132
- Frameworks and components which are not listed in the following list
133
- are unmodified and support fault tolerance. Listed frameworks may be
134
- **modified ** (and work after a failure), **untested ** (and work before
135
- a failure, but may malfunction after a failure), or **disabled ** (they
136
- cause unspecified behavior all around when FT is enabled).
137
-
138
- All runtime disabled components are listed in the ``ft-mpi `` aggregate
139
- MCA param file
140
- ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
141
- runtime behavior with ULFM by either setting or unsetting variables in
142
- this file (or by overiding the variable on the command line (e.g.,
143
- ``--mca btl ofi,self ``). Note that if fault tolerance is disabled at
144
- runtime, these components will load normally (this may change observed
145
- performance when comparing with and without fault tolerance).
146
-
147
- * ``pml ``: MPI point-to-point management layer
148
-
149
- * ``monitoring ``, ``v ``: **untested ** (they have not been modified
150
- to handle faults)
151
- * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
152
-
153
- * ``btl ``: Point-to-point Byte Transfer Layer
154
-
155
- * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
156
- **untested ** (they may work properly, please report)
157
-
158
- * ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
159
- some types of networks
160
-
161
- * All ``mtl `` components are **disabled **
162
-
163
- * ``coll ``: MPI collective algorithms
164
-
165
- * ``cuda ``, ``inter ``, ``sync ``, ``sm ``: **untested ** (they have not
166
- been modified to handle faults, but we expect correct post-fault
167
- behavior)
168
- * ``hcoll ``, ``portals4 `` **disabled ** (they have not been modified
169
- to handle faults, and we expect unspecified post-fault behavior)
170
-
171
- * ``osc ``: MPI one-sided communications
172
-
173
- * All ``osc `` components are **untested ** (they have not been
174
- modified to handle faults, and we expect unspecified post-fault
175
- behavior)
176
-
177
- * ``io ``: MPI I/O and dependent components
178
-
179
- * ``fs ``: File system functions for MPI I/O
180
- * ``fbtl ``: File byte transfer layer: abstraction for individual
181
- read/write operations for OMPIO
182
- * ``fcoll ``: Collective read and write operations for MPI I/O
183
- * ``sharedfp ``: Shared file pointer operations for MPI I/O
184
- * All components in these frameworks are unmodified, **untested **
185
- (we expect clean post-failure abort)
186
-
187
- * ``vprotocol ``: Checkpoint/Restart components
188
-
189
- * These components have not been modified to handle faults, and are
190
- **untested **.
191
-
192
- * ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
193
- object
194
-
195
- * ``argotbots ``, ``qthreads ``: **disabled ** (these components have
196
- not been modified to handle faults; we expect post-failure
197
- deadlock)
198
-
199
-
200
129
Running ULFM Open MPI
201
130
---------------------
202
131
@@ -215,12 +144,14 @@ Running your application
215
144
216
145
You can launch your application with fault tolerance by simply using
217
146
the normal Open MPI ``mpiexec `` launcher, with the
218
- ``--with-ft ulfm `` CLI option:
147
+ ``--with-ft ulfm `` CLI option (or its synonym `` --with-ft mpi ``) :
219
148
220
149
.. code-block ::
221
150
222
151
shell$ mpirun --with-ft ulfm ...
223
152
153
+ .. important:: by default, fault tolerance is not active.
154
+
224
155
Running under a batch scheduler
225
156
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
226
157
@@ -231,6 +162,18 @@ process fails. In order to avoid this problem, it is preferred that
231
162
you use ``mpiexec `` within an allocation (e.g., ``salloc ``,
232
163
``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
233
164
165
+ * SLURM is tested and supported with fault tolerance.
166
+
167
+ .. important :: Do not use ``srun``, or your application gets killed
168
+ by the scheduler upon the first failure. Instead,
169
+ use ``mpirun `` in an ``salloc/sbatch `` allocation.
170
+
171
+ * LSF is untested with fault tolerance.
172
+
173
+ * PBS/Torque is tested and supported with fault tolerance.
174
+
175
+ .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
176
+
234
177
Run-time tuning knobs
235
178
^^^^^^^^^^^^^^^^^^^^^
236
179
@@ -240,12 +183,21 @@ most cases. You can change the default settings with ``--mca
240
183
mpi_ft_foo <value> `` for Open MPI options, and with ``--prtemca
241
184
errmgr_detector_bar <value> `` for PRTE options.
242
185
186
+ .. important :: The main control for enabling/disabling fault tolerance
187
+ at runtime is the ``--with-ft ulfm `` (or its synomym
188
+ ``--with-ft mpi ``) ``mpiexec `` CLI option. This option
189
+ setup multiple subsystems of Open MPI to enable fault
190
+ tolerance. The options described below are best used to
191
+ overide the default behavior after the ``--with-ft ulfm ``
192
+ opion is used.
193
+
243
194
PRTE level options
244
195
~~~~~~~~~~~~~~~~~~
245
196
246
- * ``prrte_enable_recovery <true|false> (default: false) `` controls
197
+ * ``prrte_enable_ft <true|false> (default: false) `` controls
247
198
automatic cleanup of apps with failed processes within
248
- mpirun. Enabling this option also enables ``mpi_ft_enable ``.
199
+ mpirun. This option is automatically set to ``true `` when using
200
+ ``--with-ft ulfm ``.
249
201
* ``errmgr_detector_priority <int> (default 1005 ``) selects the
250
202
PRRTE-based failure detector. Only available when
251
203
``prte_enable_recovery `` is ``true ``. You can set this to ``0 `` when
@@ -263,17 +215,33 @@ PRTE level options
263
215
Open MPI level options
264
216
~~~~~~~~~~~~~~~~~~~~~~
265
217
266
- * ``mpi_ft_enable <true|false> (default: same as
267
- prrte_enable_recovery) `` permits turning on/off fault tolerance at
268
- runtime. When false, failure detection is disabled; Interfaces
269
- defined by the fault tolerance extensions are substituted with dummy
270
- non-fault tolerant implementations (e.g., ``MPIX_Comm_agree `` is
271
- implemented with ``MPI_Allreduce ``); All other controls below become
272
- irrelevant.
218
+ Some default values are applied to some Open MPI parameters when using
219
+ ``mpiexec --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
220
+ aggregate MCA param file
221
+ ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
222
+ runtime behavior with ULFM by either setting or unsetting variables in
223
+ this file, or by overiding the variable on the command line (e.g.,
224
+ ``--mca btl ofi,self ``).
225
+
226
+ .. important :: Note that if fault tolerance is disabled at runtime,
227
+ that is, when not using ``--with-ft ulfm ``), the
228
+ ``ft-mpi `` MCA param file is not loaded, thus
229
+ components that are unsafe for fault tolerance will
230
+ load normally (this may change observed performance
231
+ when comparing with and without fault tolerance).
232
+
233
+ * ``mpi_ft_enable <true|false> (default: false) ``
234
+ permits turning on/off fault tolerance at runtime. This option is
235
+ automatically set to ``true `` from the aggregate MCA param file
236
+ ``ft-mpi `` loaded when using ``--with-ft ulfm ``. When false, failure
237
+ detection is disabled; Interfaces defined by the fault tolerance extensions
238
+ are substituted with dummy non-fault tolerant implementations (e.g.,
239
+ ``MPIX_Comm_agree `` is implemented with ``MPI_Allreduce ``); All other
240
+ controls below become irrelevant.
273
241
* ``mpi_ft_verbose <int> (default: 0) `` increases the output of the
274
242
fault tolerance activities. A value of 1 will report detected
275
243
failures.
276
- * ``mpi_ft_detector <true|false> (default: false) ``, **EXPERIMENTAL **
244
+ * ``mpi_ft_detector <true|false> (default: false) ``, **DEPRECATED **
277
245
controls the activation of the Open MPI level failure detector. When
278
246
this detector is turned off, all failure detection is delegated to
279
247
PRTE (see above). The Open MPI level fault detector is
@@ -291,22 +259,99 @@ Open MPI level options
291
259
latency (typically 1us increase). * You may want to **enable this
292
260
option if you experience false positive ** processes incorrectly
293
261
reported as failed with the Open MPI failure detector.
262
+ This option is only relevant when `mpi_ft_detector ` is `true `.
294
263
* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
295
264
period. Recommended value is 1/3 of the timeout. _Values lower than
296
265
100us may impart a noticeable effect on latency (typically a 3us
297
266
increase)._
267
+ This option is only relevant when `mpi_ft_detector ` is `true `.
298
268
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
299
269
timeout (i.e. failure detection speed). Recommended value is 3 times
300
270
the heartbeat period.
271
+ This option is only relevant when `mpi_ft_detector ` is `true `.
301
272
302
273
Known Limitations in ULFM
303
- ^^^^^^^^^^^^^^^^^^^^^^^^^
274
+ -------------------------
304
275
305
276
* InfiniBand support is provided through the UCT BTL; fault tolerant
306
277
operation over the UCX PML is not yet supported for production runs.
307
278
* TOPO, FILE, RMA are not fault tolerant. They are expected to work
308
279
properly before the occurence of the first failure.
309
280
281
+ Modified, Untested and Disabled Components
282
+ ------------------------------------------
283
+
284
+ Frameworks and components which are not listed in the following list
285
+ are unmodified and support fault tolerance. Listed frameworks may be
286
+ **modified ** (and work after a failure), **untested ** (and work before
287
+ a failure, but may malfunction after a failure), or **disabled ** (they
288
+ cause unspecified behavior all around when FT is enabled).
289
+
290
+ All runtime disabled components are listed in the ``ft-mpi `` aggregate
291
+ MCA param file
292
+ ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
293
+ runtime behavior with ULFM by either setting or unsetting variables in
294
+ this file (or by overiding the variable on the command line (e.g.,
295
+ ``--mca btl ofi,self ``).
296
+
297
+ .. important :: Note that if fault tolerance is disabled at runtime,
298
+ the ``ft-mpi `` MCA param file is not loaded, thus
299
+ components that are unsafe for fault tolerance will
300
+ load normally (this may change observed performance
301
+ when comparing with and without fault tolerance).
302
+
303
+ * ``pml ``: MPI point-to-point management layer
304
+
305
+ * ``monitoring ``, ``v ``: **untested ** (they have not been modified
306
+ to handle faults)
307
+ * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
308
+
309
+ * ``btl ``: Point-to-point Byte Transfer Layer
310
+
311
+ * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
312
+ **untested ** (they may work properly, please report)
313
+
314
+ * ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
315
+ some types of networks
316
+
317
+ * All ``mtl `` components are **disabled **
318
+
319
+ * ``coll ``: MPI collective algorithms
320
+
321
+ * ``cuda ``, ``inter ``, ``sync ``, ``sm ``: **untested ** (they have not
322
+ been modified to handle faults, but we expect correct post-fault
323
+ behavior)
324
+ * ``hcoll ``, ``portals4 `` **disabled ** (they have not been modified
325
+ to handle faults, and we expect unspecified post-fault behavior)
326
+
327
+ * ``osc ``: MPI one-sided communications
328
+
329
+ * All ``osc `` components are **untested ** (they have not been
330
+ modified to handle faults, and we expect unspecified post-fault
331
+ behavior)
332
+
333
+ * ``io ``: MPI I/O and dependent components
334
+
335
+ * ``fs ``: File system functions for MPI I/O
336
+ * ``fbtl ``: File byte transfer layer: abstraction for individual
337
+ read/write operations for OMPIO
338
+ * ``fcoll ``: Collective read and write operations for MPI I/O
339
+ * ``sharedfp ``: Shared file pointer operations for MPI I/O
340
+ * All components in these frameworks are unmodified, **untested **
341
+ (we expect clean post-failure abort)
342
+
343
+ * ``vprotocol ``: Checkpoint/Restart components
344
+
345
+ * These components have not been modified to handle faults, and are
346
+ **untested **.
347
+
348
+ * ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
349
+ object
350
+
351
+ * ``argotbots ``, ``qthreads ``: **disabled ** (these components have
352
+ not been modified to handle faults; we expect post-failure
353
+ deadlock)
354
+
310
355
Changelog
311
356
---------
312
357
0 commit comments