Skip to content

Commit d97ae55

Browse files
committed
Update the ULFM Readme
Comment indentation Signed-off-by: Aurelien Bouteiller <[email protected]> ulfm readme: review comments from Jeff Signed-off-by: Aurelien Bouteiller <[email protected]> ulfm readme: Indentation of the 'important' and lists Signed-off-by: Aurelien Bouteiller <[email protected]>
1 parent 96fadd9 commit d97ae55

File tree

2 files changed

+166
-114
lines changed

2 files changed

+166
-114
lines changed

docs/features/ulfm.rst

Lines changed: 162 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,18 @@ User-Level Fault Mitigation (ULFM)
66
This chapter documents the features and options specific to the **User
77
Level Failure Mitigation (ULFM)** Open MPI implementation.
88

9+
Quick Start
10+
-----------
11+
12+
This is an extremely terse summary of how to use ULFM:
13+
14+
.. code-block::
15+
16+
shell$ ./configure --with-ft=ulfm [...options...]
17+
shell$ make [-j N] all install
18+
shell$ mpicc my-ft-program.c -o my-ft-program
19+
shell$ mpirun -n 4 --with-ft ulfm my-ft-program
20+
921
Features
1022
--------
1123

@@ -100,102 +112,17 @@ Available from: https://journals.sagepub.com/doi/10.1177/1094342013488238.
100112
Building ULFM support in Open MPI
101113
---------------------------------
102114

103-
In Open MPI |ompi_ver|, ULFM support is **enabled by default** |mdash|
104-
when you build Open MPI, unless you specify ``--without-ft``, ULFM
105-
support will automatically be built.
115+
In Open MPI |ompi_ver|, ULFM support is **built-in by default** |mdash|
116+
that is, when you build Open MPI, unless you specify ``--without-ft``, ULFM
117+
support is automatically available (but is inactive unless enabled at
118+
runtime).
106119

107-
Optionally, you can specify ``--with-ft`` to ensure that ULFM support
120+
Optionally, you can specify ``--with-ft ulfm`` to ensure that ULFM support
108121
is definitely built.
109122

110-
Support notes
111-
^^^^^^^^^^^^^
112-
113-
* ULFM Fault Tolerance does not apply to OpenSHMEM. It is recommended
114-
that if you are going to use ULFM, you should disable building
115-
OpenSHMEM with ``--disable-oshmem``.
116-
117-
* SLURM is tested and supported with fault tolerance.
118-
119-
.. important:: Do not use ``srun``, or your application gets killed
120-
by the scheduler upon the first failure. Instead,
121-
use ``mpirun`` in an ``salloc/sbatch`` allocation.
122-
123-
* LSF is untested with fault tolerance.
124-
125-
* PBS/Torque is tested and supported with fault tolerance.
126-
127-
.. important:: Be sure to use ``mpirun`` in a ``qsub`` allocation.
128-
129-
Modified, Untested and Disabled Components
130-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
131-
132-
Frameworks and components which are not listed in the following list
133-
are unmodified and support fault tolerance. Listed frameworks may be
134-
**modified** (and work after a failure), **untested** (and work before
135-
a failure, but may malfunction after a failure), or **disabled** (they
136-
cause unspecified behavior all around when FT is enabled).
137-
138-
All runtime disabled components are listed in the ``ft-mpi`` aggregate
139-
MCA param file
140-
``$installdir/share/openmpi/amca-param-sets/ft-mpi``. You can tune the
141-
runtime behavior with ULFM by either setting or unsetting variables in
142-
this file (or by overriding the variable on the command line (e.g.,
143-
``--mca btl ofi,self``). Note that if fault tolerance is disabled at
144-
runtime, these components will load normally (this may change observed
145-
performance when comparing with and without fault tolerance).
146-
147-
* ``pml``: MPI point-to-point management layer
148-
149-
* ``monitoring``, ``v``: **untested** (they have not been modified
150-
to handle faults)
151-
* ``cm``, ``crcpw``, ``ucx``: **disabled**
152-
153-
* ``btl``: Point-to-point Byte Transfer Layer
154-
155-
* ``ofi``, ``portals4``, ``smcuda``, ``usnic``, ``sm(+knem)``:
156-
**untested** (they may work properly, please report)
157-
158-
* ``mtl``: Matching transport layer Used for MPI point-to-point messages on
159-
some types of networks
160-
161-
* All ``mtl`` components are **disabled**
162-
163-
* ``coll``: MPI collective algorithms
164-
165-
* ``cuda``, ``inter``, ``sync``, ``sm``: **untested** (they have not
166-
been modified to handle faults, but we expect correct post-fault
167-
behavior)
168-
* ``hcoll``, ``portals4`` **disabled** (they have not been modified
169-
to handle faults, and we expect unspecified post-fault behavior)
170-
171-
* ``osc``: MPI one-sided communications
172-
173-
* All ``osc`` components are **untested** (they have not been
174-
modified to handle faults, and we expect unspecified post-fault
175-
behavior)
176-
177-
* ``io``: MPI I/O and dependent components
178-
179-
* ``fs``: File system functions for MPI I/O
180-
* ``fbtl``: File byte transfer layer: abstraction for individual
181-
read/write operations for OMPIO
182-
* ``fcoll``: Collective read and write operations for MPI I/O
183-
* ``sharedfp``: Shared file pointer operations for MPI I/O
184-
* All components in these frameworks are unmodified, **untested**
185-
(we expect clean post-failure abort)
186-
187-
* ``vprotocol``: Checkpoint/Restart components
188-
189-
* These components have not been modified to handle faults, and are
190-
**untested**.
191-
192-
* ``threads``, ``wait-sync``: Multithreaded wait-synchronization
193-
object
194-
195-
* ``argotbots``, ``qthreads``: **disabled** (these components have
196-
not been modified to handle faults; we expect post-failure
197-
deadlock)
198-
123+
.. note:: ULFM Fault Tolerance does not apply to OpenSHMEM. It is recommended
124+
that if you are going to use ULFM, you should disable building OpenSHMEM
125+
with ``--disable-oshmem``.
199126

200127
Running ULFM Open MPI
201128
---------------------
@@ -214,22 +141,40 @@ Running your application
214141
^^^^^^^^^^^^^^^^^^^^^^^^
215142

216143
You can launch your application with fault tolerance by simply using
217-
the normal Open MPI ``mpiexec`` launcher, with the
218-
``--with-ft ulfm`` CLI option:
144+
the normal Open MPI ``mpirun`` launcher, with the
145+
``--with-ft ulfm`` CLI option (or its synonym ``--with-ft mpi``):
219146

220147
.. code-block::
221148
222149
shell$ mpirun --with-ft ulfm ...
223150
151+
.. important:: By default, fault tolerance is not active at run time.
152+
It must be enabled via ``--with-ft ulfm``.
153+
224154
Running under a batch scheduler
225155
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
226156

227157
ULFM can operate under a job/batch scheduler, and is tested routinely
228158
with ALPS, PBS, and Slurm. One difficulty comes from the fact that
229-
many job schedulers will "cleanup" the application as soon as any
230-
process fails. In order to avoid this problem, it is preferred that
231-
you use ``mpiexec`` within an allocation (e.g., ``salloc``,
232-
``sbatch``, ``qsub``) rather than a direct launch (e.g., ``srun``).
159+
many job schedulers handle failures by triggering an immediate "cleanup"
160+
of the application as soon as any process fails. In addition, failure
161+
detection subsystems integrated into PRTE are not active in direct launch
162+
scenarios and may not have a launcher specific alternative. This may cause
163+
the application to not detect failures and lock. In order to avoid these
164+
problems, it is preferred that you use ``mpirun`` within an allocation
165+
(e.g., ``salloc``, ``sbatch``, ``qsub``) rather than a direct launch.
166+
167+
* SLURM is tested and supported with fault tolerance.
168+
169+
.. important:: Use ``mpirun`` in an ``salloc`` or ``sbatch`` allocation.
170+
Direct launch with ``srun`` is not supported.
171+
172+
* PBS/Torque is tested and supported with fault tolerance.
173+
174+
.. important:: Use ``mpirun`` in a ``qsub`` allocation. Direct launch
175+
with ``aprun`` is not supported.
176+
177+
* LSF is untested with fault tolerance.
233178

234179
Run-time tuning knobs
235180
^^^^^^^^^^^^^^^^^^^^^
@@ -240,12 +185,20 @@ most cases. You can change the default settings with ``--mca
240185
mpi_ft_foo <value>`` for Open MPI options, and with ``--prtemca
241186
errmgr_detector_bar <value>`` for PRTE options.
242187

188+
.. important:: The main control for enabling/disabling fault tolerance
189+
at runtime is the ``--with-ft ulfm`` (or its synomym ``--with-ft mpi``)
190+
``mpirun`` CLI option. This option sets up multiple subsystems in
191+
Open MPI to enable fault tolerance. The options described below are
192+
best used to override the default behavior after the ``--with-ft ulfm``
193+
opion is used.
194+
243195
PRTE level options
244196
~~~~~~~~~~~~~~~~~~
245197

246-
* ``prrte_enable_recovery <true|false> (default: false)`` controls
198+
* ``prrte_enable_ft <true|false> (default: false)`` controls
247199
automatic cleanup of apps with failed processes within
248-
mpirun. Enabling this option also enables ``mpi_ft_enable``.
200+
``mpirun``. This option is automatically set to ``true`` when using
201+
``--with-ft ulfm``.
249202
* ``errmgr_detector_priority <int> (default 1005``) selects the
250203
PRRTE-based failure detector. Only available when
251204
``prte_enable_recovery`` is ``true``. You can set this to ``0`` when
@@ -263,17 +216,32 @@ PRTE level options
263216
Open MPI level options
264217
~~~~~~~~~~~~~~~~~~~~~~
265218

266-
* ``mpi_ft_enable <true|false> (default: same as
267-
prrte_enable_recovery)`` permits turning on/off fault tolerance at
268-
runtime. When false, failure detection is disabled; Interfaces
269-
defined by the fault tolerance extensions are substituted with dummy
270-
non-fault tolerant implementations (e.g., ``MPIX_Comm_agree`` is
271-
implemented with ``MPI_Allreduce``); All other controls below become
272-
irrelevant.
219+
Default values are applied to some Open MPI parameters when using
220+
``mpirun --with-ft ulfm``. These defaults are obtained from the ``ft-mpi``
221+
aggregate MCA param file
222+
``$installdir/share/openmpi/amca-param-sets/ft-mpi``. You can tune the
223+
runtime behavior of ULFM by either setting or unsetting variables in
224+
this file, or by overriding the variable on the command line (e.g.,
225+
``--mca btl ofi,self``).
226+
227+
.. important:: Note that if fault tolerance is disabled at runtime,
228+
(that is, when not using ``--with-ft ulfm``), the ``ft-mpi`` AMCA
229+
param file is not loaded, thus components that are unsafe for fault
230+
tolerance will load normally (this may change observed performance
231+
when comparing with and without fault tolerance).
232+
233+
* ``mpi_ft_enable <true|false> (default: false)``
234+
permits turning on/off fault tolerance at runtime. This option is
235+
automatically set to ``true`` from the aggregate MCA param file
236+
``ft-mpi`` loaded when using ``--with-ft ulfm``. When false, failure
237+
detection is disabled; Interfaces defined by the fault tolerance extensions
238+
are substituted with dummy non-fault tolerant implementations (e.g.,
239+
``MPIX_Comm_agree`` is implemented with ``MPI_Allreduce``); All other
240+
controls below become irrelevant.
273241
* ``mpi_ft_verbose <int> (default: 0)`` increases the output of the
274242
fault tolerance activities. A value of 1 will report detected
275243
failures.
276-
* ``mpi_ft_detector <true|false> (default: false)``, **EXPERIMENTAL**
244+
* ``mpi_ft_detector <true|false> (default: false)``, **DEPRECATED**
277245
controls the activation of the Open MPI level failure detector. When
278246
this detector is turned off, all failure detection is delegated to
279247
PRTE (see above). The Open MPI level fault detector is
@@ -284,29 +252,113 @@ Open MPI level options
284252
``MPI_COMM_WORLD`` exclusively. Processes connected from
285253
``MPI_COMM_CONNECT``/``ACCEPT`` and ``MPI_COMM_SPAWN`` may
286254
occasionally not be detected when they fail.
255+
256+
.. caution:: This component is deprecated. Failure detection is now
257+
performed at the PRTE level. See the section above on controlling
258+
PRTE behavior for information about how to tune the failure detector.
259+
287260
* ``mpi_ft_detector_thread <true|false> (default: false)`` controls
288261
the use of a thread to emit and receive failure detector's
289262
heartbeats. *Setting this value to "true" will also set
290263
MPI_THREAD_MULTIPLE support, which has a noticeable effect on
291264
latency (typically 1us increase).* You may want to **enable this
292265
option if you experience false positive** processes incorrectly
293266
reported as failed with the Open MPI failure detector.
267+
268+
.. important:: This option is only relevant when ``mpi_ft_detector`` is ``true``.
269+
294270
* ``mpi_ft_detector_period <float> (default: 3e0 seconds)`` heartbeat
295271
period. Recommended value is 1/3 of the timeout. _Values lower than
296272
100us may impart a noticeable effect on latency (typically a 3us
297273
increase)._
274+
275+
.. important:: This option is only relevant when ``mpi_ft_detector`` is ``true``.
276+
298277
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds)`` heartbeat
299278
timeout (i.e. failure detection speed). Recommended value is 3 times
300279
the heartbeat period.
301280

281+
.. important:: This option is only relevant when ``mpi_ft_detector`` is ``true``.
282+
302283
Known Limitations in ULFM
303-
^^^^^^^^^^^^^^^^^^^^^^^^^
284+
-------------------------
304285

305286
* InfiniBand support is provided through the UCT BTL; fault tolerant
306287
operation over the UCX PML is not yet supported for production runs.
307288
* TOPO, FILE, RMA are not fault tolerant. They are expected to work
308289
properly before the occurrence of the first failure.
309290

291+
Modified, Untested and Disabled Components
292+
------------------------------------------
293+
294+
Frameworks and components are listed below and categorized into one of
295+
three classifications:
296+
297+
1. **Modified:** This framework/component has been specifically modified
298+
such that it will continue to work after a failure.
299+
2. **Untested:** This framework/component has not been modified and/or
300+
tested with fault tolerance scenarios, and _may_ malfunction
301+
after a failure.
302+
3. **Disabled:** This framework/component will cause unspecified behavior when
303+
fault tolerance is enabled. As a consequence, it will be disabled when the
304+
``--with-ft ulfm`` option is used (see above for defails about implicit
305+
parameters loaded from the ``ft-mpi`` aggregate param file).
306+
307+
Any framework or component not listed below are categorized as **Unmodified**,
308+
meaning that it is unmodified for fault tolerance, but will continue to work
309+
correctly after a failure.
310+
311+
* ``pml``: MPI point-to-point management layer
312+
313+
* ``monitoring``, ``v``: **untested** (they have not been modified to handle
314+
faults)
315+
* ``cm``, ``crcpw``, ``ucx``: **disabled**
316+
317+
* ``btl``: Point-to-point Byte Transfer Layer
318+
319+
* ``ofi``, ``portals4``, ``smcuda``, ``usnic``, ``sm(+knem)``: **untested**
320+
(they may work properly, please report)
321+
322+
* ``mtl``: Matching transport layer Used for MPI point-to-point messages on
323+
some types of networks
324+
325+
* All ``mtl`` components are **disabled**
326+
327+
* ``coll``: MPI collective algorithms
328+
329+
* ``cuda``, ``inter``, ``sync``, ``sm``: **untested** (they have not
330+
been modified to handle faults, but we expect correct post-fault
331+
behavior)
332+
* ``hcoll``, ``portals4`` **disabled** (they have not been modified
333+
to handle faults, and we expect unspecified post-fault behavior)
334+
335+
* ``osc``: MPI one-sided communications
336+
337+
* All ``osc`` components are **untested** (they have not been
338+
modified to handle faults, and we expect unspecified post-fault
339+
behavior)
340+
341+
* ``io``: MPI I/O and dependent components
342+
343+
* ``fs``: File system functions for MPI I/O
344+
* ``fbtl``: File byte transfer layer: abstraction for individual
345+
read/write operations for OMPIO
346+
* ``fcoll``: Collective read and write operations for MPI I/O
347+
* ``sharedfp``: Shared file pointer operations for MPI I/O
348+
* All components in these frameworks are unmodified, **untested**
349+
(we expect clean post-failure abort)
350+
351+
* ``vprotocol``: Checkpoint/Restart components
352+
353+
* All ``vprotocol`` components are **untested**
354+
355+
* ``threads``, ``wait-sync``: Multithreaded wait-synchronization
356+
object
357+
358+
* ``argotbots``, ``qthreads``: **disabled** (these components have
359+
not been modified to handle faults; we expect post-failure
360+
deadlock)
361+
310362
Changelog
311363
---------
312364

ompi/runtime/ompi_mpi_params.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -406,10 +406,10 @@ int ompi_mpi_register_params(void)
406406

407407
#if OPAL_ENABLE_FT_MPI
408408
/* Before loading any other part of the MPI library, we need to load
409-
* * the ft-mpi tune file to override default component selection when
410-
* * FT is desired ON; this does override openmpi-params.conf, but not
411-
* * command line or env.
412-
* */
409+
* the ft-mpi tune file to override default component selection when
410+
* FT is desired ON; this does override openmpi-params.conf, but not
411+
* command line or env.
412+
*/
413413
if( ompi_ftmpi_enabled ) {
414414
mca_base_var_load_extra_files("ft-mpi", false);
415415
}

0 commit comments

Comments
 (0)