Skip to content

Commit 9053e5e

Browse files
committed
ulfm readme: review comments from Jeff
Signed-off-by: Aurelien Bouteiller <[email protected]>
1 parent d8bde8f commit 9053e5e

File tree

1 file changed

+29
-33
lines changed

1 file changed

+29
-33
lines changed

docs/features/ulfm.rst

Lines changed: 29 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ This is an extremely terse summary of how to use ULFM:
1616
shell$ ./configure --with-ft=ulfm [...options...]
1717
shell$ make [-j N] all install
1818
shell$ mpicc my-ft-program.c -o my-ft-program
19-
shell$ mpiexec -n 4 --with-ft ulfm my-ft-program
19+
shell$ mpirun -n 4 --with-ft ulfm my-ft-program
2020
2121
Features
2222
--------
@@ -144,14 +144,15 @@ Running your application
144144
^^^^^^^^^^^^^^^^^^^^^^^^
145145

146146
You can launch your application with fault tolerance by simply using
147-
the normal Open MPI ``mpiexec`` launcher, with the
147+
the normal Open MPI ``mpirun`` launcher, with the
148148
``--with-ft ulfm`` CLI option (or its synonym ``--with-ft mpi``):
149149

150150
.. code-block::
151151
152152
shell$ mpirun --with-ft ulfm ...
153153
154-
.. important:: by default, fault tolerance is not active.
154+
.. important:: By default, fault tolerance is not active at run time.
155+
It must be enabled via `--with-ft ulfm`.
155156
156157
Running under a batch scheduler
157158
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -160,14 +161,14 @@ ULFM can operate under a job/batch scheduler, and is tested routinely
160161
with ALPS, PBS, and Slurm. One difficulty comes from the fact that
161162
many job schedulers will "cleanup" the application as soon as any
162163
process fails. In order to avoid this problem, it is preferred that
163-
you use ``mpiexec`` within an allocation (e.g., ``salloc``,
164+
you use ``mpirun`` within an allocation (e.g., ``salloc``,
164165
``sbatch``, ``qsub``) rather than a direct launch (e.g., ``srun``).
165166

166167
* SLURM is tested and supported with fault tolerance.
167168

168169
.. important:: Do not use ``srun``, or your application gets killed
169170
by the scheduler upon the first failure. Instead,
170-
use ``mpirun`` in an ``salloc/sbatch`` allocation.
171+
use ``mpirun`` in an ``salloc`` or ``sbatch`` allocation.
171172

172173
* LSF is untested with fault tolerance.
173174

@@ -186,8 +187,8 @@ errmgr_detector_bar <value>`` for PRTE options.
186187

187188
.. important:: The main control for enabling/disabling fault tolerance
188189
at runtime is the ``--with-ft ulfm`` (or its synomym
189-
``--with-ft mpi``) ``mpiexec`` CLI option. This option
190-
setup multiple subsystems of Open MPI to enable fault
190+
``--with-ft mpi``) ``mpirun`` CLI option. This option
191+
sets up multiple subsystems in Open MPI to enable fault
191192
tolerance. The options described below are best used to
192193
overide the default behavior after the ``--with-ft ulfm``
193194
opion is used.
@@ -216,17 +217,17 @@ PRTE level options
216217
Open MPI level options
217218
~~~~~~~~~~~~~~~~~~~~~~
218219

219-
Some default values are applied to some Open MPI parameters when using
220-
``mpiexec --with-ft ulfm``. These defaults are obtained from the ``ft-mpi``
220+
Default values are applied to some Open MPI parameters when using
221+
``mpirun --with-ft ulfm``. These defaults are obtained from the ``ft-mpi``
221222
aggregate MCA param file
222223
``$installdir/share/openmpi/amca-param-sets/ft-mpi``. You can tune the
223-
runtime behavior with ULFM by either setting or unsetting variables in
224+
runtime behavior of ULFM by either setting or unsetting variables in
224225
this file, or by overiding the variable on the command line (e.g.,
225226
``--mca btl ofi,self``).
226227

227228
.. important:: Note that if fault tolerance is disabled at runtime,
228-
that is, when not using ``--with-ft ulfm``), the
229-
``ft-mpi`` MCA param file is not loaded, thus
229+
(that is, when not using ``--with-ft ulfm``), the
230+
``ft-mpi`` AMCA param file is not loaded, thus
230231
components that are unsafe for fault tolerance will
231232
load normally (this may change observed performance
232233
when comparing with and without fault tolerance).
@@ -260,16 +261,16 @@ this file, or by overiding the variable on the command line (e.g.,
260261
latency (typically 1us increase).* You may want to **enable this
261262
option if you experience false positive** processes incorrectly
262263
reported as failed with the Open MPI failure detector.
263-
This option is only relevant when `mpi_ft_detector` is `true`.
264+
This option is only relevant when ``mpi_ft_detector`` is ``true``.
264265
* ``mpi_ft_detector_period <float> (default: 3e0 seconds)`` heartbeat
265266
period. Recommended value is 1/3 of the timeout. _Values lower than
266267
100us may impart a noticeable effect on latency (typically a 3us
267268
increase)._
268-
This option is only relevant when `mpi_ft_detector` is `true`.
269+
This option is only relevant when ``mpi_ft_detector`` is ``true``.
269270
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds)`` heartbeat
270271
timeout (i.e. failure detection speed). Recommended value is 3 times
271272
the heartbeat period.
272-
This option is only relevant when `mpi_ft_detector` is `true`.
273+
This option is only relevant when ``mpi_ft_detector`` is ``true``.
273274

274275
Known Limitations in ULFM
275276
-------------------------
@@ -282,24 +283,20 @@ Known Limitations in ULFM
282283
Modified, Untested and Disabled Components
283284
------------------------------------------
284285

285-
Frameworks and components which are not listed in the following list
286-
are unmodified and support fault tolerance. Listed frameworks may be
287-
**modified** (and work after a failure), **untested** (and work before
288-
a failure, but may malfunction after a failure), or **disabled** (they
289-
cause unspecified behavior all around when FT is enabled).
286+
Frameworks and components are listed below and categorized into one of
287+
three classifications:
290288

291-
All runtime disabled components are listed in the ``ft-mpi`` aggregate
292-
MCA param file
293-
``$installdir/share/openmpi/amca-param-sets/ft-mpi``. You can tune the
294-
runtime behavior with ULFM by either setting or unsetting variables in
295-
this file (or by overiding the variable on the command line (e.g.,
296-
``--mca btl ofi,self``).
289+
1. **Modified:** This framework/component has been specifically modified
290+
such that it will continue to work after a failure.
291+
2. **Untested:** This framework/component has not been modified and/or
292+
tested with fault tolerance scenarios, and _may_ malfunction
293+
after a failure.
294+
3. **Disabled:** This framework/component will cause unspecified behavior when
295+
fault tolerance is enabled.
297296

298-
.. important:: Note that if fault tolerance is disabled at runtime,
299-
the ``ft-mpi`` MCA param file is not loaded, thus
300-
components that are unsafe for fault tolerance will
301-
load normally (this may change observed performance
302-
when comparing with and without fault tolerance).
297+
Any framework or component not listed below are categorized as **Unmodified**,
298+
meaning that it is unmodified for fault tolerance, but will continue to work
299+
correctly after a failure.
303300

304301
* ``pml``: MPI point-to-point management layer
305302

@@ -343,8 +340,7 @@ this file (or by overiding the variable on the command line (e.g.,
343340

344341
* ``vprotocol``: Checkpoint/Restart components
345342

346-
* These components have not been modified to handle faults, and are
347-
**untested**.
343+
* All ``vprotocol`` components are **untested**
348344

349345
* ``threads``, ``wait-sync``: Multithreaded wait-synchronization
350346
object

0 commit comments

Comments
 (0)