@@ -120,12 +120,9 @@ runtime).
120
120
Optionally, you can specify ``--with-ft ulfm `` to ensure that ULFM support
121
121
is definitely built.
122
122
123
- Support notes
124
- ^^^^^^^^^^^^^
125
-
126
- * ULFM Fault Tolerance does not apply to OpenSHMEM. It is recomended
127
- that if you are going to use ULFM, you should disable building
128
- OpenSHMEM with ``--disable-oshmem ``.
123
+ .. note :: ULFM Fault Tolerance does not apply to OpenSHMEM. It is recomended
124
+ that if you are going to use ULFM, you should disable building OpenSHMEM
125
+ with ``--disable-oshmem ``.
129
126
130
127
Running ULFM Open MPI
131
128
---------------------
@@ -151,30 +148,33 @@ the normal Open MPI ``mpirun`` launcher, with the
151
148
152
149
shell$ mpirun --with-ft ulfm ...
153
150
154
- .. important:: By default, fault tolerance is not active at run time.
155
- It must be enabled via `--with-ft ulfm`.
151
+ .. important :: By default, fault tolerance is not active at run time.
152
+ It must be enabled via `` --with-ft ulfm ` `.
156
153
157
154
Running under a batch scheduler
158
155
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159
156
160
157
ULFM can operate under a job/batch scheduler, and is tested routinely
161
158
with ALPS, PBS, and Slurm. One difficulty comes from the fact that
162
- many job schedulers will "cleanup" the application as soon as any
163
- process fails. In order to avoid this problem, it is preferred that
164
- you use ``mpirun `` within an allocation (e.g., ``salloc ``,
165
- ``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
159
+ many job schedulers handle failures by triggering an immediate "cleanup"
160
+ of the application as soon as any process fails. In addition, failure
161
+ detection subsystems integrated into PRTE are not active in direct launch
162
+ scenarios and may not have a launcher specific alternative. This may cause
163
+ the application to not detect failures and lock. In order to avoid these
164
+ problems, it is preferred that you use ``mpirun `` within an allocation
165
+ (e.g., ``salloc ``, ``sbatch ``, ``qsub ``) rather than a direct launch.
166
166
167
167
* SLURM is tested and supported with fault tolerance.
168
168
169
- .. important :: Do not use ``srun``, or your application gets killed
170
- by the scheduler upon the first failure. Instead,
171
- use ``mpirun `` in an ``salloc `` or ``sbatch `` allocation.
172
-
173
- * LSF is untested with fault tolerance.
169
+ .. important :: Use ``mpirun`` in an ``salloc`` or ``sbatch`` allocation.
170
+ Direct launch with ``srun `` is not supported.
174
171
175
172
* PBS/Torque is tested and supported with fault tolerance.
176
173
177
- .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
174
+ .. important :: Use ``mpirun`` in a ``qsub`` allocation. Direct launch
175
+ with ``aprun `` is not supported.
176
+
177
+ * LSF is untested with fault tolerance.
178
178
179
179
Run-time tuning knobs
180
180
^^^^^^^^^^^^^^^^^^^^^
@@ -185,13 +185,12 @@ most cases. You can change the default settings with ``--mca
185
185
mpi_ft_foo <value> `` for Open MPI options, and with ``--prtemca
186
186
errmgr_detector_bar <value> `` for PRTE options.
187
187
188
- .. important :: The main control for enabling/disabling fault tolerance
189
- at runtime is the ``--with-ft ulfm `` (or its synomym
190
- ``--with-ft mpi ``) ``mpirun `` CLI option. This option
191
- sets up multiple subsystems in Open MPI to enable fault
192
- tolerance. The options described below are best used to
193
- overide the default behavior after the ``--with-ft ulfm ``
194
- opion is used.
188
+ .. important :: The main control for enabling/disabling fault tolerance
189
+ at runtime is the ``--with-ft ulfm `` (or its synomym ``--with-ft mpi ``)
190
+ ``mpirun `` CLI option. This option sets up multiple subsystems in
191
+ Open MPI to enable fault tolerance. The options described below are
192
+ best used to overide the default behavior after the ``--with-ft ulfm ``
193
+ opion is used.
195
194
196
195
PRTE level options
197
196
~~~~~~~~~~~~~~~~~~
@@ -225,12 +224,11 @@ runtime behavior of ULFM by either setting or unsetting variables in
225
224
this file, or by overiding the variable on the command line (e.g.,
226
225
``--mca btl ofi,self ``).
227
226
228
- .. important :: Note that if fault tolerance is disabled at runtime,
229
- (that is, when not using ``--with-ft ulfm ``), the
230
- ``ft-mpi `` AMCA param file is not loaded, thus
231
- components that are unsafe for fault tolerance will
232
- load normally (this may change observed performance
233
- when comparing with and without fault tolerance).
227
+ .. important :: Note that if fault tolerance is disabled at runtime,
228
+ (that is, when not using ``--with-ft ulfm ``), the ``ft-mpi `` AMCA
229
+ param file is not loaded, thus components that are unsafe for fault
230
+ tolerance will load normally (this may change observed performance
231
+ when comparing with and without fault tolerance).
234
232
235
233
* ``mpi_ft_enable <true|false> (default: false) ``
236
234
permits turning on/off fault tolerance at runtime. This option is
@@ -254,23 +252,33 @@ this file, or by overiding the variable on the command line (e.g.,
254
252
``MPI_COMM_WORLD `` exclusively. Processes connected from
255
253
``MPI_COMM_CONNECT ``/``ACCEPT `` and ``MPI_COMM_SPAWN `` may
256
254
occasionally not be detected when they fail.
255
+
256
+ .. caution :: This component is deprecated. Failure detection is now
257
+ performed at the PRTE level. See the section above on controlling
258
+ PRTE behavior for information about how to tune the failure detector.
259
+
257
260
* ``mpi_ft_detector_thread <true|false> (default: false) `` controls
258
261
the use of a thread to emit and receive failure detector's
259
262
heartbeats. *Setting this value to "true" will also set
260
263
MPI_THREAD_MULTIPLE support, which has a noticeable effect on
261
264
latency (typically 1us increase). * You may want to **enable this
262
265
option if you experience false positive ** processes incorrectly
263
266
reported as failed with the Open MPI failure detector.
264
- This option is only relevant when ``mpi_ft_detector `` is ``true ``.
267
+
268
+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
269
+
265
270
* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
266
271
period. Recommended value is 1/3 of the timeout. _Values lower than
267
272
100us may impart a noticeable effect on latency (typically a 3us
268
273
increase)._
269
- This option is only relevant when ``mpi_ft_detector `` is ``true ``.
274
+
275
+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
276
+
270
277
* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
271
278
timeout (i.e. failure detection speed). Recommended value is 3 times
272
279
the heartbeat period.
273
- This option is only relevant when ``mpi_ft_detector `` is ``true ``.
280
+
281
+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
274
282
275
283
Known Limitations in ULFM
276
284
-------------------------
@@ -287,27 +295,27 @@ Frameworks and components are listed below and categorized into one of
287
295
three classifications:
288
296
289
297
1. **Modified: ** This framework/component has been specifically modified
290
- such that it will continue to work after a failure.
298
+ such that it will continue to work after a failure.
291
299
2. **Untested: ** This framework/component has not been modified and/or
292
- tested with fault tolerance scenarios, and _may_ malfunction
293
- after a failure.
300
+ tested with fault tolerance scenarios, and _may_ malfunction
301
+ after a failure.
294
302
3. **Disabled: ** This framework/component will cause unspecified behavior when
295
- fault tolerance is enabled.
303
+ fault tolerance is enabled.
296
304
297
305
Any framework or component not listed below are categorized as **Unmodified **,
298
306
meaning that it is unmodified for fault tolerance, but will continue to work
299
307
correctly after a failure.
300
308
301
309
* ``pml ``: MPI point-to-point management layer
302
310
303
- * ``monitoring ``, ``v ``: **untested ** (they have not been modified
304
- to handle faults)
311
+ * ``monitoring ``, ``v ``: **untested ** (they have not been modified to handle
312
+ faults)
305
313
* ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
306
314
307
315
* ``btl ``: Point-to-point Byte Transfer Layer
308
316
309
- * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
310
- ** untested ** (they may work properly, please report)
317
+ * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``: ** untested **
318
+ (they may work properly, please report)
311
319
312
320
* ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
313
321
some types of networks
0 commit comments