@@ -6,6 +6,18 @@ User-Level Fault Mitigation (ULFM)
6
6
This chapter documents the features and options specific to the **User
7
7
Level Failure Mitigation (ULFM) ** Open MPI implementation.
8
8
9
+ Quick Start
10
+ -----------
11
+
12
+ This is an extremely terse summary of how to use ULFM:
13
+
14
+ .. code-block ::
15
+
16
+ shell$ ./configure --with-ft=ulfm [...options...]
17
+ shell$ make [-j N] all install
18
+ shell$ mpicc my-ft-program.c -o my-ft-program
19
+ shell$ mpiexec -n 4 --with-ft ulfm my-ft-program
20
+
9
21
Features
10
22
--------
11
23
@@ -100,11 +112,12 @@ Available from: https://journals.sagepub.com/doi/10.1177/1094342013488238.
100
112
Building ULFM support in Open MPI
101
113
---------------------------------
102
114
103
- In Open MPI |ompi_ver |, ULFM support is **enabled by default ** |mdash |
104
- when you build Open MPI, unless you specify ``--without-ft ``, ULFM
105
- support will automatically be built.
115
+ In Open MPI |ompi_ver |, ULFM support is **built-in by default ** |mdash |
116
+ that is, when you build Open MPI, unless you specify ``--without-ft ``, ULFM
117
+ support is automatically available (but is inactive unless enabled at
118
+ runtime).
106
119
107
- Optionally, you can specify ``--with-ft `` to ensure that ULFM support
120
+ Optionally, you can specify ``--with-ft ulfm `` to ensure that ULFM support
108
121
is definitely built.
109
122
110
123
Support notes
@@ -114,89 +127,6 @@ Support notes
114
127
that if you are going to use ULFM, you should disable building
115
128
OpenSHMEM with ``--disable-oshmem ``.
116
129
117
- * SLURM is tested and supported with fault tolerance.
118
-
119
- .. important :: Do not use ``srun``, or your application gets killed
120
- by the scheduler upon the first failure. Instead,
121
- use ``mpirun `` in an ``salloc/sbatch `` allocation.
122
-
123
- * LSF is untested with fault tolerance.
124
-
125
- * PBS/Torque is tested and supported with fault tolerance.
126
-
127
- .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
128
-
129
- Modified, Untested and Disabled Components
130
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131
-
132
- Frameworks and components which are not listed in the following list
133
- are unmodified and support fault tolerance. Listed frameworks may be
134
- **modified ** (and work after a failure), **untested ** (and work before
135
- a failure, but may malfunction after a failure), or **disabled ** (they
136
- cause unspecified behavior all around when FT is enabled).
137
-
138
- All runtime disabled components are listed in the ``ft-mpi `` aggregate
139
- MCA param file
140
- ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
141
- runtime behavior with ULFM by either setting or unsetting variables in
142
- this file (or by overiding the variable on the command line (e.g.,
143
- ``--mca btl ofi,self ``). Note that if fault tolerance is disabled at
144
- runtime, these components will load normally (this may change observed
145
- performance when comparing with and without fault tolerance).
146
-
147
- * ``pml ``: MPI point-to-point management layer
148
-
149
- * ``monitoring ``, ``v ``: **untested ** (they have not been modified
150
- to handle faults)
151
- * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
152
-
153
- * ``btl ``: Point-to-point Byte Transfer Layer
154
-
155
- * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
156
- **untested ** (they may work properly, please report)
157
-
158
- * ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
159
- some types of networks
160
-
161
- * All ``mtl `` components are **disabled **
162
-
163
- * ``coll ``: MPI collective algorithms
164
-
165
- * ``cuda ``, ``inter ``, ``sync ``, ``sm ``: **untested ** (they have not
166
- been modified to handle faults, but we expect correct post-fault
167
- behavior)
168
- * ``hcoll ``, ``portals4 `` **disabled ** (they have not been modified
169
- to handle faults, and we expect unspecified post-fault behavior)
170
-
171
- * ``osc ``: MPI one-sided communications
172
-
173
- * All ``osc `` components are **untested ** (they have not been
174
- modified to handle faults, and we expect unspecified post-fault
175
- behavior)
176
-
177
- * ``io ``: MPI I/O and dependent components
178
-
179
- * ``fs ``: File system functions for MPI I/O
180
- * ``fbtl ``: File byte transfer layer: abstraction for individual
181
- read/write operations for OMPIO
182
- * ``fcoll ``: Collective read and write operations for MPI I/O
183
- * ``sharedfp ``: Shared file pointer operations for MPI I/O
184
- * All components in these frameworks are unmodified, **untested **
185
- (we expect clean post-failure abort)
186
-
187
- * ``vprotocol ``: Checkpoint/Restart components
188
-
189
- * These components have not been modified to handle faults, and are
190
- **untested **.
191
-
192
- * ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
193
- object
194
-
195
- * ``argotbots ``, ``qthreads ``: **disabled ** (these components have
196
- not been modified to handle faults; we expect post-failure
197
- deadlock)
198
-
199
-
200
130
Running ULFM Open MPI
201
131
---------------------
202
132
@@ -215,12 +145,14 @@ Running your application
215
145
216
146
You can launch your application with fault tolerance by simply using
217
147
the normal Open MPI ``mpiexec `` launcher, with the
218
- ``--with-ft ulfm `` CLI option:
148
+ ``--with-ft ulfm `` CLI option (or its synonym `` --with-ft mpi ``) :
219
149
220
150
.. code-block ::
221
151
222
152
shell$ mpirun --with-ft ulfm ...
223
153
154
+ .. important:: by default, fault tolerance is not active.
155
+
224
156
Running under a batch scheduler
225
157
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
226
158
@@ -231,6 +163,18 @@ process fails. In order to avoid this problem, it is preferred that
231
163
you use ``mpiexec `` within an allocation (e.g., ``salloc ``,
232
164
``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
233
165
166
+ * SLURM is tested and supported with fault tolerance.
167
+
168
+ .. important :: Do not use ``srun``, or your application gets killed
169
+ by the scheduler upon the first failure. Instead,
170
+ use ``mpirun `` in an ``salloc/sbatch `` allocation.
171
+
172
+ * LSF is untested with fault tolerance.
173
+
174
+ * PBS/Torque is tested and supported with fault tolerance.
175
+
176
+ .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
177
+
234
178
Run-time tuning knobs
235
179
^^^^^^^^^^^^^^^^^^^^^
236
180
@@ -240,12 +184,21 @@ most cases. You can change the default settings with ``--mca
240
184
mpi_ft_foo <value> `` for Open MPI options, and with ``--prtemca
241
185
errmgr_detector_bar <value> `` for PRTE options.
242
186
187
+ .. important :: The main control for enabling/disabling fault tolerance
188
+ at runtime is the ``--with-ft ulfm `` (or its synomym
189
+ ``--with-ft mpi ``) ``mpiexec `` CLI option. This option
190
+ setup multiple subsystems of Open MPI to enable fault
191
+ tolerance. The options described below are best used to
192
+ overide the default behavior after the ``--with-ft ulfm ``
193
+ opion is used.
194
+
243
195
PRTE level options
244
196
~~~~~~~~~~~~~~~~~~
245
197
246
- * ``prrte_enable_recovery <true|false> (default: false) `` controls
198
+ * ``prrte_enable_ft <true|false> (default: false) `` controls
247
199
automatic cleanup of apps with failed processes within
248
- mpirun. Enabling this option also enables ``mpi_ft_enable ``.
200
+ mpirun. This option is automatically set to ``true `` when using
201
+ ``--with-ft ulfm ``.
249
202
* ``errmgr_detector_priority <int> (default 1005 ``) selects the
250
203
PRRTE-based failure detector. Only available when
251
204
``prte_enable_recovery `` is ``true ``. You can set this to ``0 `` when
@@ -263,17 +216,33 @@ PRTE level options
263
216
Open MPI level options
264
217
~~~~~~~~~~~~~~~~~~~~~~
265
218
266
- * ``mpi_ft_enable <true|false> (default: same as
267
- prrte_enable_recovery) `` permits turning on/off fault tolerance at
268
- runtime. When false, failure detection is disabled; Interfaces
269
- defined by the fault tolerance extensions are substituted with dummy
270
- non-fault tolerant implementations (e.g., ``MPIX_Comm_agree `` is
271
- implemented with ``MPI_Allreduce ``); All other controls below become
272
- irrelevant.
219
+ Some default values are applied to some Open MPI parameters when using
220
+ ``mpiexec --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
221
+ aggregate MCA param file
222
+ ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
223
+ runtime behavior with ULFM by either setting or unsetting variables in
224
+ this file, or by overiding the variable on the command line (e.g.,
225
+ ``--mca btl ofi,self ``).
226
+
227
+ .. important :: Note that if fault tolerance is disabled at runtime,
228
+ that is, when not using ``--with-ft ulfm ``), the
229
+ ``ft-mpi `` MCA param file is not loaded, thus
230
+ components that are unsafe for fault tolerance will
231
+ load normally (this may change observed performance
232
+ when comparing with and without fault tolerance).
233
+
234
+ * ``mpi_ft_enable <true|false> (default: false) ``
235
+ permits turning on/off fault tolerance at runtime. This option is
236
+ automatically set to ``true `` from the aggregate MCA param file
237
+ ``ft-mpi `` loaded when using ``--with-ft ulfm ``. When false, failure
238
+ detection is disabled; Interfaces defined by the fault tolerance extensions
239
+ are substituted with dummy non-fault tolerant implementations (e.g.,
240
+ ``MPIX_Comm_agree `` is implemented with ``MPI_Allreduce ``); All other
241
+ controls below become irrelevant.
273
242
* ``mpi_ft_verbose <int> (default: 0) `` increases the output of the
274
243
fault tolerance activities. A value of 1 will report detected
275
244
failures.
276
- * ``mpi_ft_detector <true|false> (default: false) ``, **EXPERIMENTAL **
245
+ * ``mpi_ft_detector <true|false> (default: false) ``, **DEPRECATED **
277
246
controls the activation of the Open MPI level failure detector. When
278
247
this detector is turned off, all failure detection is delegated to
279
248
PRTE (see above). The Open MPI level fault detector is
@@ -291,22 +260,99 @@ Open MPI level options
291
260
latency (typically 1us increase). * You may want to **enable this
292
261
option if you experience false positive ** processes incorrectly
293
262
reported as failed with the Open MPI failure detector.
263
+ This option is only relevant when `mpi_ft_detector ` is `true `.
294
264
* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
295
265
period. Recommended value is 1/3 of the timeout. _Values lower than
296
266
100us may impart a noticeable effect on latency (typically a 3us
297
267
increase)._
268
+ This option is only relevant when `mpi_ft_detector ` is `true `.
298
269
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
299
270
timeout (i.e. failure detection speed). Recommended value is 3 times
300
271
the heartbeat period.
272
+ This option is only relevant when `mpi_ft_detector ` is `true `.
301
273
302
274
Known Limitations in ULFM
303
- ^^^^^^^^^^^^^^^^^^^^^^^^^
275
+ -------------------------
304
276
305
277
* InfiniBand support is provided through the UCT BTL; fault tolerant
306
278
operation over the UCX PML is not yet supported for production runs.
307
279
* TOPO, FILE, RMA are not fault tolerant. They are expected to work
308
280
properly before the occurence of the first failure.
309
281
282
+ Modified, Untested and Disabled Components
283
+ ------------------------------------------
284
+
285
+ Frameworks and components which are not listed in the following list
286
+ are unmodified and support fault tolerance. Listed frameworks may be
287
+ **modified ** (and work after a failure), **untested ** (and work before
288
+ a failure, but may malfunction after a failure), or **disabled ** (they
289
+ cause unspecified behavior all around when FT is enabled).
290
+
291
+ All runtime disabled components are listed in the ``ft-mpi `` aggregate
292
+ MCA param file
293
+ ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
294
+ runtime behavior with ULFM by either setting or unsetting variables in
295
+ this file (or by overiding the variable on the command line (e.g.,
296
+ ``--mca btl ofi,self ``).
297
+
298
+ .. important :: Note that if fault tolerance is disabled at runtime,
299
+ the ``ft-mpi `` MCA param file is not loaded, thus
300
+ components that are unsafe for fault tolerance will
301
+ load normally (this may change observed performance
302
+ when comparing with and without fault tolerance).
303
+
304
+ * ``pml ``: MPI point-to-point management layer
305
+
306
+ * ``monitoring ``, ``v ``: **untested ** (they have not been modified
307
+ to handle faults)
308
+ * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
309
+
310
+ * ``btl ``: Point-to-point Byte Transfer Layer
311
+
312
+ * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
313
+ **untested ** (they may work properly, please report)
314
+
315
+ * ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
316
+ some types of networks
317
+
318
+ * All ``mtl `` components are **disabled **
319
+
320
+ * ``coll ``: MPI collective algorithms
321
+
322
+ * ``cuda ``, ``inter ``, ``sync ``, ``sm ``: **untested ** (they have not
323
+ been modified to handle faults, but we expect correct post-fault
324
+ behavior)
325
+ * ``hcoll ``, ``portals4 `` **disabled ** (they have not been modified
326
+ to handle faults, and we expect unspecified post-fault behavior)
327
+
328
+ * ``osc ``: MPI one-sided communications
329
+
330
+ * All ``osc `` components are **untested ** (they have not been
331
+ modified to handle faults, and we expect unspecified post-fault
332
+ behavior)
333
+
334
+ * ``io ``: MPI I/O and dependent components
335
+
336
+ * ``fs ``: File system functions for MPI I/O
337
+ * ``fbtl ``: File byte transfer layer: abstraction for individual
338
+ read/write operations for OMPIO
339
+ * ``fcoll ``: Collective read and write operations for MPI I/O
340
+ * ``sharedfp ``: Shared file pointer operations for MPI I/O
341
+ * All components in these frameworks are unmodified, **untested **
342
+ (we expect clean post-failure abort)
343
+
344
+ * ``vprotocol ``: Checkpoint/Restart components
345
+
346
+ * These components have not been modified to handle faults, and are
347
+ **untested **.
348
+
349
+ * ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
350
+ object
351
+
352
+ * ``argotbots ``, ``qthreads ``: **disabled ** (these components have
353
+ not been modified to handle faults; we expect post-failure
354
+ deadlock)
355
+
310
356
Changelog
311
357
---------
312
358
0 commit comments