@@ -6,6 +6,18 @@ User-Level Fault Mitigation (ULFM)
6
6
This chapter documents the features and options specific to the **User
7
7
Level Failure Mitigation (ULFM) ** Open MPI implementation.
8
8
9
+ Quick Start
10
+ -----------
11
+
12
+ This is an extremely terse summary of how to use ULFM:
13
+
14
+ .. code-block ::
15
+
16
+ shell$ ./configure --with-ft=ulfm [...options...]
17
+ shell$ make [-j N] all install
18
+ shell$ mpicc my-ft-program.c -o my-ft-program
19
+ shell$ mpirun -n 4 --with-ft ulfm my-ft-program
20
+
9
21
Features
10
22
--------
11
23
@@ -100,102 +112,17 @@ Available from: https://journals.sagepub.com/doi/10.1177/1094342013488238.
100
112
Building ULFM support in Open MPI
101
113
---------------------------------
102
114
103
- In Open MPI |ompi_ver |, ULFM support is **enabled by default ** |mdash |
104
- when you build Open MPI, unless you specify ``--without-ft ``, ULFM
105
- support will automatically be built.
115
+ In Open MPI |ompi_ver |, ULFM support is **built-in by default ** |mdash |
116
+ that is, when you build Open MPI, unless you specify ``--without-ft ``, ULFM
117
+ support is automatically available (but is inactive unless enabled at
118
+ runtime).
106
119
107
- Optionally, you can specify ``--with-ft `` to ensure that ULFM support
120
+ Optionally, you can specify ``--with-ft ulfm `` to ensure that ULFM support
108
121
is definitely built.
109
122
110
- Support notes
111
- ^^^^^^^^^^^^^
112
-
113
- * ULFM Fault Tolerance does not apply to OpenSHMEM. It is recommended
114
- that if you are going to use ULFM, you should disable building
115
- OpenSHMEM with ``--disable-oshmem ``.
116
-
117
- * SLURM is tested and supported with fault tolerance.
118
-
119
- .. important :: Do not use ``srun``, or your application gets killed
120
- by the scheduler upon the first failure. Instead,
121
- use ``mpirun `` in an ``salloc/sbatch `` allocation.
122
-
123
- * LSF is untested with fault tolerance.
124
-
125
- * PBS/Torque is tested and supported with fault tolerance.
126
-
127
- .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
128
-
129
- Modified, Untested and Disabled Components
130
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131
-
132
- Frameworks and components which are not listed in the following list
133
- are unmodified and support fault tolerance. Listed frameworks may be
134
- **modified ** (and work after a failure), **untested ** (and work before
135
- a failure, but may malfunction after a failure), or **disabled ** (they
136
- cause unspecified behavior all around when FT is enabled).
137
-
138
- All runtime disabled components are listed in the ``ft-mpi `` aggregate
139
- MCA param file
140
- ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
141
- runtime behavior with ULFM by either setting or unsetting variables in
142
- this file (or by overriding the variable on the command line (e.g.,
143
- ``--mca btl ofi,self ``). Note that if fault tolerance is disabled at
144
- runtime, these components will load normally (this may change observed
145
- performance when comparing with and without fault tolerance).
146
-
147
- * ``pml ``: MPI point-to-point management layer
148
-
149
- * ``monitoring ``, ``v ``: **untested ** (they have not been modified
150
- to handle faults)
151
- * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
152
-
153
- * ``btl ``: Point-to-point Byte Transfer Layer
154
-
155
- * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
156
- **untested ** (they may work properly, please report)
157
-
158
- * ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
159
- some types of networks
160
-
161
- * All ``mtl `` components are **disabled **
162
-
163
- * ``coll ``: MPI collective algorithms
164
-
165
- * ``cuda ``, ``inter ``, ``sync ``, ``sm ``: **untested ** (they have not
166
- been modified to handle faults, but we expect correct post-fault
167
- behavior)
168
- * ``hcoll ``, ``portals4 `` **disabled ** (they have not been modified
169
- to handle faults, and we expect unspecified post-fault behavior)
170
-
171
- * ``osc ``: MPI one-sided communications
172
-
173
- * All ``osc `` components are **untested ** (they have not been
174
- modified to handle faults, and we expect unspecified post-fault
175
- behavior)
176
-
177
- * ``io ``: MPI I/O and dependent components
178
-
179
- * ``fs ``: File system functions for MPI I/O
180
- * ``fbtl ``: File byte transfer layer: abstraction for individual
181
- read/write operations for OMPIO
182
- * ``fcoll ``: Collective read and write operations for MPI I/O
183
- * ``sharedfp ``: Shared file pointer operations for MPI I/O
184
- * All components in these frameworks are unmodified, **untested **
185
- (we expect clean post-failure abort)
186
-
187
- * ``vprotocol ``: Checkpoint/Restart components
188
-
189
- * These components have not been modified to handle faults, and are
190
- **untested **.
191
-
192
- * ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
193
- object
194
-
195
- * ``argotbots ``, ``qthreads ``: **disabled ** (these components have
196
- not been modified to handle faults; we expect post-failure
197
- deadlock)
198
-
123
+ .. note :: ULFM Fault Tolerance does not apply to OpenSHMEM. It is recommended
124
+ that if you are going to use ULFM, you should disable building OpenSHMEM
125
+ with ``--disable-oshmem ``.
199
126
200
127
Running ULFM Open MPI
201
128
---------------------
@@ -214,22 +141,40 @@ Running your application
214
141
^^^^^^^^^^^^^^^^^^^^^^^^
215
142
216
143
You can launch your application with fault tolerance by simply using
217
- the normal Open MPI ``mpiexec `` launcher, with the
218
- ``--with-ft ulfm `` CLI option:
144
+ the normal Open MPI ``mpirun `` launcher, with the
145
+ ``--with-ft ulfm `` CLI option (or its synonym `` --with-ft mpi ``) :
219
146
220
147
.. code-block ::
221
148
222
149
shell$ mpirun --with-ft ulfm ...
223
150
151
+ .. important :: By default, fault tolerance is not active at run time.
152
+ It must be enabled via ``--with-ft ulfm ``.
153
+
224
154
Running under a batch scheduler
225
155
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
226
156
227
157
ULFM can operate under a job/batch scheduler, and is tested routinely
228
158
with ALPS, PBS, and Slurm. One difficulty comes from the fact that
229
- many job schedulers will "cleanup" the application as soon as any
230
- process fails. In order to avoid this problem, it is preferred that
231
- you use ``mpiexec `` within an allocation (e.g., ``salloc ``,
232
- ``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
159
+ many job schedulers handle failures by triggering an immediate "cleanup"
160
+ of the application as soon as any process fails. In addition, failure
161
+ detection subsystems integrated into PRTE are not active in direct launch
162
+ scenarios and may not have a launcher specific alternative. This may cause
163
+ the application to not detect failures and lock. In order to avoid these
164
+ problems, it is preferred that you use ``mpirun `` within an allocation
165
+ (e.g., ``salloc ``, ``sbatch ``, ``qsub ``) rather than a direct launch.
166
+
167
+ * SLURM is tested and supported with fault tolerance.
168
+
169
+ .. important :: Use ``mpirun`` in an ``salloc`` or ``sbatch`` allocation.
170
+ Direct launch with ``srun `` is not supported.
171
+
172
+ * PBS/Torque is tested and supported with fault tolerance.
173
+
174
+ .. important :: Use ``mpirun`` in a ``qsub`` allocation. Direct launch
175
+ with ``aprun `` is not supported.
176
+
177
+ * LSF is untested with fault tolerance.
233
178
234
179
Run-time tuning knobs
235
180
^^^^^^^^^^^^^^^^^^^^^
@@ -240,12 +185,20 @@ most cases. You can change the default settings with ``--mca
240
185
mpi_ft_foo <value> `` for Open MPI options, and with ``--prtemca
241
186
errmgr_detector_bar <value> `` for PRTE options.
242
187
188
+ .. important :: The main control for enabling/disabling fault tolerance
189
+ at runtime is the ``--with-ft ulfm `` (or its synomym ``--with-ft mpi ``)
190
+ ``mpirun `` CLI option. This option sets up multiple subsystems in
191
+ Open MPI to enable fault tolerance. The options described below are
192
+ best used to override the default behavior after the ``--with-ft ulfm ``
193
+ opion is used.
194
+
243
195
PRTE level options
244
196
~~~~~~~~~~~~~~~~~~
245
197
246
- * ``prrte_enable_recovery <true|false> (default: false) `` controls
198
+ * ``prrte_enable_ft <true|false> (default: false) `` controls
247
199
automatic cleanup of apps with failed processes within
248
- mpirun. Enabling this option also enables ``mpi_ft_enable ``.
200
+ ``mpirun ``. This option is automatically set to ``true `` when using
201
+ ``--with-ft ulfm ``.
249
202
* ``errmgr_detector_priority <int> (default 1005 ``) selects the
250
203
PRRTE-based failure detector. Only available when
251
204
``prte_enable_recovery `` is ``true ``. You can set this to ``0 `` when
@@ -263,17 +216,32 @@ PRTE level options
263
216
Open MPI level options
264
217
~~~~~~~~~~~~~~~~~~~~~~
265
218
266
- * ``mpi_ft_enable <true|false> (default: same as
267
- prrte_enable_recovery) `` permits turning on/off fault tolerance at
268
- runtime. When false, failure detection is disabled; Interfaces
269
- defined by the fault tolerance extensions are substituted with dummy
270
- non-fault tolerant implementations (e.g., ``MPIX_Comm_agree `` is
271
- implemented with ``MPI_Allreduce ``); All other controls below become
272
- irrelevant.
219
+ Default values are applied to some Open MPI parameters when using
220
+ ``mpirun --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
221
+ aggregate MCA param file
222
+ ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
223
+ runtime behavior of ULFM by either setting or unsetting variables in
224
+ this file, or by overriding the variable on the command line (e.g.,
225
+ ``--mca btl ofi,self ``).
226
+
227
+ .. important :: Note that if fault tolerance is disabled at runtime,
228
+ (that is, when not using ``--with-ft ulfm ``), the ``ft-mpi `` AMCA
229
+ param file is not loaded, thus components that are unsafe for fault
230
+ tolerance will load normally (this may change observed performance
231
+ when comparing with and without fault tolerance).
232
+
233
+ * ``mpi_ft_enable <true|false> (default: false) ``
234
+ permits turning on/off fault tolerance at runtime. This option is
235
+ automatically set to ``true `` from the aggregate MCA param file
236
+ ``ft-mpi `` loaded when using ``--with-ft ulfm ``. When false, failure
237
+ detection is disabled; Interfaces defined by the fault tolerance extensions
238
+ are substituted with dummy non-fault tolerant implementations (e.g.,
239
+ ``MPIX_Comm_agree `` is implemented with ``MPI_Allreduce ``); All other
240
+ controls below become irrelevant.
273
241
* ``mpi_ft_verbose <int> (default: 0) `` increases the output of the
274
242
fault tolerance activities. A value of 1 will report detected
275
243
failures.
276
- * ``mpi_ft_detector <true|false> (default: false) ``, **EXPERIMENTAL **
244
+ * ``mpi_ft_detector <true|false> (default: false) ``, **DEPRECATED **
277
245
controls the activation of the Open MPI level failure detector. When
278
246
this detector is turned off, all failure detection is delegated to
279
247
PRTE (see above). The Open MPI level fault detector is
@@ -284,29 +252,113 @@ Open MPI level options
284
252
``MPI_COMM_WORLD `` exclusively. Processes connected from
285
253
``MPI_COMM_CONNECT ``/``ACCEPT `` and ``MPI_COMM_SPAWN `` may
286
254
occasionally not be detected when they fail.
255
+
256
+ .. caution :: This component is deprecated. Failure detection is now
257
+ performed at the PRTE level. See the section above on controlling
258
+ PRTE behavior for information about how to tune the failure detector.
259
+
287
260
* ``mpi_ft_detector_thread <true|false> (default: false) `` controls
288
261
the use of a thread to emit and receive failure detector's
289
262
heartbeats. *Setting this value to "true" will also set
290
263
MPI_THREAD_MULTIPLE support, which has a noticeable effect on
291
264
latency (typically 1us increase). * You may want to **enable this
292
265
option if you experience false positive ** processes incorrectly
293
266
reported as failed with the Open MPI failure detector.
267
+
268
+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
269
+
294
270
* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
295
271
period. Recommended value is 1/3 of the timeout. _Values lower than
296
272
100us may impart a noticeable effect on latency (typically a 3us
297
273
increase)._
274
+
275
+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
276
+
298
277
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
299
278
timeout (i.e. failure detection speed). Recommended value is 3 times
300
279
the heartbeat period.
301
280
281
+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
282
+
302
283
Known Limitations in ULFM
303
- ^^^^^^^^^^^^^^^^^^^^^^^^^
284
+ -------------------------
304
285
305
286
* InfiniBand support is provided through the UCT BTL; fault tolerant
306
287
operation over the UCX PML is not yet supported for production runs.
307
288
* TOPO, FILE, RMA are not fault tolerant. They are expected to work
308
289
properly before the occurrence of the first failure.
309
290
291
+ Modified, Untested and Disabled Components
292
+ ------------------------------------------
293
+
294
+ Frameworks and components are listed below and categorized into one of
295
+ three classifications:
296
+
297
+ 1. **Modified: ** This framework/component has been specifically modified
298
+ such that it will continue to work after a failure.
299
+ 2. **Untested: ** This framework/component has not been modified and/or
300
+ tested with fault tolerance scenarios, and _may_ malfunction
301
+ after a failure.
302
+ 3. **Disabled: ** This framework/component will cause unspecified behavior when
303
+ fault tolerance is enabled. As a consequence, it will be disabled when the
304
+ ``--with-ft ulfm `` option is used (see above for defails about implicit
305
+ parameters loaded from the ``ft-mpi `` aggregate param file).
306
+
307
+ Any framework or component not listed below are categorized as **Unmodified **,
308
+ meaning that it is unmodified for fault tolerance, but will continue to work
309
+ correctly after a failure.
310
+
311
+ * ``pml ``: MPI point-to-point management layer
312
+
313
+ * ``monitoring ``, ``v ``: **untested ** (they have not been modified to handle
314
+ faults)
315
+ * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
316
+
317
+ * ``btl ``: Point-to-point Byte Transfer Layer
318
+
319
+ * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``: **untested **
320
+ (they may work properly, please report)
321
+
322
+ * ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
323
+ some types of networks
324
+
325
+ * All ``mtl `` components are **disabled **
326
+
327
+ * ``coll ``: MPI collective algorithms
328
+
329
+ * ``cuda ``, ``inter ``, ``sync ``, ``sm ``: **untested ** (they have not
330
+ been modified to handle faults, but we expect correct post-fault
331
+ behavior)
332
+ * ``hcoll ``, ``portals4 `` **disabled ** (they have not been modified
333
+ to handle faults, and we expect unspecified post-fault behavior)
334
+
335
+ * ``osc ``: MPI one-sided communications
336
+
337
+ * All ``osc `` components are **untested ** (they have not been
338
+ modified to handle faults, and we expect unspecified post-fault
339
+ behavior)
340
+
341
+ * ``io ``: MPI I/O and dependent components
342
+
343
+ * ``fs ``: File system functions for MPI I/O
344
+ * ``fbtl ``: File byte transfer layer: abstraction for individual
345
+ read/write operations for OMPIO
346
+ * ``fcoll ``: Collective read and write operations for MPI I/O
347
+ * ``sharedfp ``: Shared file pointer operations for MPI I/O
348
+ * All components in these frameworks are unmodified, **untested **
349
+ (we expect clean post-failure abort)
350
+
351
+ * ``vprotocol ``: Checkpoint/Restart components
352
+
353
+ * All ``vprotocol `` components are **untested **
354
+
355
+ * ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
356
+ object
357
+
358
+ * ``argotbots ``, ``qthreads ``: **disabled ** (these components have
359
+ not been modified to handle faults; we expect post-failure
360
+ deadlock)
361
+
310
362
Changelog
311
363
---------
312
364
0 commit comments