@@ -16,7 +16,7 @@ This is an extremely terse summary of how to use ULFM:
16
16
shell$ ./configure --with-ft=ulfm [...options...]
17
17
shell$ make [-j N] all install
18
18
shell$ mpicc my-ft-program.c -o my-ft-program
19
- shell$ mpiexec -n 4 --with-ft ulfm my-ft-program
19
+ shell$ mpirun -n 4 --with-ft ulfm my-ft-program
20
20
21
21
Features
22
22
--------
@@ -144,14 +144,15 @@ Running your application
144
144
^^^^^^^^^^^^^^^^^^^^^^^^
145
145
146
146
You can launch your application with fault tolerance by simply using
147
- the normal Open MPI ``mpiexec `` launcher, with the
147
+ the normal Open MPI ``mpirun `` launcher, with the
148
148
``--with-ft ulfm `` CLI option (or its synonym ``--with-ft mpi ``):
149
149
150
150
.. code-block ::
151
151
152
152
shell$ mpirun --with-ft ulfm ...
153
153
154
- .. important:: by default, fault tolerance is not active.
154
+ .. important:: By default, fault tolerance is not active at run time.
155
+ It must be enabled via `--with-ft ulfm`.
155
156
156
157
Running under a batch scheduler
157
158
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -160,14 +161,14 @@ ULFM can operate under a job/batch scheduler, and is tested routinely
160
161
with ALPS, PBS, and Slurm. One difficulty comes from the fact that
161
162
many job schedulers will "cleanup" the application as soon as any
162
163
process fails. In order to avoid this problem, it is preferred that
163
- you use ``mpiexec `` within an allocation (e.g., ``salloc ``,
164
+ you use ``mpirun `` within an allocation (e.g., ``salloc ``,
164
165
``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
165
166
166
167
* SLURM is tested and supported with fault tolerance.
167
168
168
169
.. important :: Do not use ``srun``, or your application gets killed
169
170
by the scheduler upon the first failure. Instead,
170
- use ``mpirun `` in an ``salloc/ sbatch `` allocation.
171
+ use ``mpirun `` in an ``salloc `` or `` sbatch `` allocation.
171
172
172
173
* LSF is untested with fault tolerance.
173
174
@@ -186,8 +187,8 @@ errmgr_detector_bar <value>`` for PRTE options.
186
187
187
188
.. important :: The main control for enabling/disabling fault tolerance
188
189
at runtime is the ``--with-ft ulfm `` (or its synomym
189
- ``--with-ft mpi ``) ``mpiexec `` CLI option. This option
190
- setup multiple subsystems of Open MPI to enable fault
190
+ ``--with-ft mpi ``) ``mpirun `` CLI option. This option
191
+ sets up multiple subsystems in Open MPI to enable fault
191
192
tolerance. The options described below are best used to
192
193
overide the default behavior after the ``--with-ft ulfm ``
193
194
opion is used.
@@ -216,17 +217,17 @@ PRTE level options
216
217
Open MPI level options
217
218
~~~~~~~~~~~~~~~~~~~~~~
218
219
219
- Some default values are applied to some Open MPI parameters when using
220
- ``mpiexec --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
220
+ Default values are applied to some Open MPI parameters when using
221
+ ``mpirun --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
221
222
aggregate MCA param file
222
223
``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
223
- runtime behavior with ULFM by either setting or unsetting variables in
224
+ runtime behavior of ULFM by either setting or unsetting variables in
224
225
this file, or by overiding the variable on the command line (e.g.,
225
226
``--mca btl ofi,self ``).
226
227
227
228
.. important :: Note that if fault tolerance is disabled at runtime,
228
- that is, when not using ``--with-ft ulfm ``), the
229
- ``ft-mpi `` MCA param file is not loaded, thus
229
+ ( that is, when not using ``--with-ft ulfm ``), the
230
+ ``ft-mpi `` AMCA param file is not loaded, thus
230
231
components that are unsafe for fault tolerance will
231
232
load normally (this may change observed performance
232
233
when comparing with and without fault tolerance).
@@ -260,16 +261,16 @@ this file, or by overiding the variable on the command line (e.g.,
260
261
latency (typically 1us increase). * You may want to **enable this
261
262
option if you experience false positive ** processes incorrectly
262
263
reported as failed with the Open MPI failure detector.
263
- This option is only relevant when `mpi_ft_detector ` is `true `.
264
+ This option is only relevant when `` mpi_ft_detector `` is `` true ` `.
264
265
* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
265
266
period. Recommended value is 1/3 of the timeout. _Values lower than
266
267
100us may impart a noticeable effect on latency (typically a 3us
267
268
increase)._
268
- This option is only relevant when `mpi_ft_detector ` is `true `.
269
+ This option is only relevant when `` mpi_ft_detector `` is `` true ` `.
269
270
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
270
271
timeout (i.e. failure detection speed). Recommended value is 3 times
271
272
the heartbeat period.
272
- This option is only relevant when `mpi_ft_detector ` is `true `.
273
+ This option is only relevant when `` mpi_ft_detector `` is `` true ` `.
273
274
274
275
Known Limitations in ULFM
275
276
-------------------------
@@ -282,24 +283,20 @@ Known Limitations in ULFM
282
283
Modified, Untested and Disabled Components
283
284
------------------------------------------
284
285
285
- Frameworks and components which are not listed in the following list
286
- are unmodified and support fault tolerance. Listed frameworks may be
287
- **modified ** (and work after a failure), **untested ** (and work before
288
- a failure, but may malfunction after a failure), or **disabled ** (they
289
- cause unspecified behavior all around when FT is enabled).
286
+ Frameworks and components are listed below and categorized into one of
287
+ three classifications:
290
288
291
- All runtime disabled components are listed in the ``ft-mpi `` aggregate
292
- MCA param file
293
- ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
294
- runtime behavior with ULFM by either setting or unsetting variables in
295
- this file (or by overiding the variable on the command line (e.g.,
296
- ``--mca btl ofi,self ``).
289
+ 1. **Modified: ** This framework/component has been specifically modified
290
+ such that it will continue to work after a failure.
291
+ 2. **Untested: ** This framework/component has not been modified and/or
292
+ tested with fault tolerance scenarios, and _may_ malfunction
293
+ after a failure.
294
+ 3. **Disabled: ** This framework/component will cause unspecified behavior when
295
+ fault tolerance is enabled.
297
296
298
- .. important :: Note that if fault tolerance is disabled at runtime,
299
- the ``ft-mpi `` MCA param file is not loaded, thus
300
- components that are unsafe for fault tolerance will
301
- load normally (this may change observed performance
302
- when comparing with and without fault tolerance).
297
+ Any framework or component not listed below are categorized as **Unmodified **,
298
+ meaning that it is unmodified for fault tolerance, but will continue to work
299
+ correctly after a failure.
303
300
304
301
* ``pml ``: MPI point-to-point management layer
305
302
@@ -343,8 +340,7 @@ this file (or by overiding the variable on the command line (e.g.,
343
340
344
341
* ``vprotocol ``: Checkpoint/Restart components
345
342
346
- * These components have not been modified to handle faults, and are
347
- **untested **.
343
+ * All ``vprotocol `` components are **untested **
348
344
349
345
* ``threads ``, ``wait-sync ``: Multithreaded wait-synchronization
350
346
object
0 commit comments