Skip to content

Retire the FAQ section of the docs, moving its content to other locations #11531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 20, 2023

Conversation

qkoziol
Copy link
Contributor

@qkoziol qkoziol commented Mar 24, 2023

Log of changes for moving FAQ questions to other locations

Non-FAQ section changes:

  • Corrected MPI sub-version rendering bug in section 3.5.1 (MPI Standard Compliance), by creating and using new 'mpi_standard_full_version' substitution in conf.py
  • Added strikethru_start and strikethru_end substitutions in conf.py
  • Broke out updating/upgrading an Open MPI installation from within section 4.11.2 (Installing over a prior Open MPI installation) into a new section 4.12 (Updating or Upgrading an Open MPI installation)
  • s/ackwards/ackward/g in section 7 (Version numbers and compatibility)
  • Added link to section 8.3 (Setting MCA parameter values) to section 3.4 (General Run-Time Support Notes)
  • Added new section 10.4 (Scheduling processes across hosts)
  • Added new section 10.11 (Unusual jobs) to section 10 (Launching MPI applications)
  • Added new section 10.12 (Troubleshooting) to section 10 (Launching MPI applications)
  • Changed title of section 11 from (Run-time tuning MPI applications) to (Run-time operation and tuning MPI applications)
  • Added new subsection 11.4 (Fault tolerance) to section 11 (Run-time operation and tuning MPI applications)
  • Added new subsection 11.5 (Large clusters) to section 11 (Run-time operation and tuning MPI applications)
  • Added new subsection 11.6 (Processor and memory affinity) to section 11 (Run-time operation and tuning MPI applications)

FAQ section changes:

Supported Systems: (13.1)

  • Moved 13.1.1 (What operating systems does Open MPI support?), 13.1.2 (What hardware platforms does Open MPI support?), and 13.1.3 (What network interconnects does Open MPI support?) into Section 4 (Building and Installing Open MPI) as new section 4.2 (Supported Systems), between previous sections 4.1 and 4.2.
  • Moved 13.1.4 (How does Open MPI interface to back-end run-time systems?) to the top of section 10.3 (The role of PMIx and PRRTE).
  • Moved 13.1.5 (What run-time environments does Open MPI support?) to the top of section 3.2 (Platform Notes)
  • Deleted 13.1.6 (How much MPI does Open MPI support?), as it duplicates information in section 3.5.1 (MPI Standard Compliance)
  • Moved 13.1.7 (Is Open MPI thread safe?) to section 9 (Building MPI Applications) as new section 9.7.
  • Moved 13.1.8 (Does Open MPI support 64 bit environments?) to section 9 (Building MPI Applications) as new section 9.8.
  • Moved 13.1.9 (Does Open MPI support execution in heterogeneous environments?) to section 9 (Building MPI Applications) as new section 9.9.
  • Moved 13.1.10 (Does Open MPI support parallel debuggers?) to section 12 (Debugging Open MPI Parallel Applications) as new section 12.4, between previous sections 12.3 and 12.4.

System administrator-level technical information: (13.2)

  • Moved 13.2.1 (I’m a sysadmin; what do I care about Open MPI?) to section 4 (Building and installing Open MPI) as new section 4.14 (Advice for System Administrators)
  • Moved 13.2.2 (Do I need multiple Open MPI installations?) to the end of section 4.11 (Installation Location), as new subsection 4.11.4 (Installing Multiple Copies of Open MPI).
  • Moved 13.2.3 (What are MCA Parameters? Why would I set them?) to section 4 (Building and installing Open MPI) as 4.14.1 (Setting Global MCA Parameters), within new section 4.14 (Advice for System Administrators)
  • Moved 13.2.4 (Do my users need to have their own installation of Open MPI?) to section 4 (Building and installing Open MPI) as 4.14.5 (User customization of a global Open MPI installation), within section 4.14 (Advice for System Administrators)
  • Deleted 13.2.5 (I have power users who will want to override my global MCA parameters; is this possible?), as the information is already incorporated into new section 4.14.5 (User customization of a global Open MPI installation).
  • Moved 13.2.6 (What MCA parameters should I, the system administrator, set?) to section 4 (Building and installing Open MPI) as 4.14.2 (Setting MCA Parameters for a Global Open MPI installation), within section 4.14 (Advice for System Administrators)
  • Moved 13.2.7 (I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?) to section 4 (Building and installing Open MPI) as 4.14.3 (Adding a new plugin to a global Open MPI installation), within section 4.14 (Advice for System Administrators)
  • Moved 13.2.8 (I just upgraded my InfiniBand network; do I need to recompile all my MPI apps?) to section 4 (Building and installing Open MPI) as 4.14.4 (Upgrading network hardware with a global Open MPI installation), within section 4.14 (Advice for System Administrators)
  • Moved 13.2.9 (We just upgraded our version of Open MPI; do I need to recompile all my MPI apps?) into new section 4.12 (Updating or Upgrading an Open MPI installation)
  • Moved 13.2.10 (I have an MPI application compiled for another MPI; will it work with Open MPI?) to be a warning at the top page of section 9 (Building MPI applications)

Building Open MPI: (13.3)

  • Moved 13.3.1 (How do I statically link to the libraries of Intel compiler suite?) to section 4 (Building and installing Open MPI) as new section 4.6.1 (Statically linking to the libraries of Intel compiler suite), within section 4.6 (Specifying Compilers and flags)
  • Moved 13.3.2 (Why do I get errors about hwloc or libevent not found?) to section 4 (Building and installing Open MPI) as new section 4.7.5 (Difficulties with C and Fortran), within section 4.7 (Required support libraries)

Running MPI Applications: (13.4)

  • Moved / integrated content from 13.4.1 (What prerequisites are necessary for running an Open MPI job?) into section 10.2 (Prerequisites)
  • Moved 13.4.2 (What ABI guarantees does Open MPI provide?) into section 7 (Version numbers and backward compatibility)) as new section 7.2 (Application Binary Interface (ABI) Compatibility)
  • Moved / integrated content from 13.4.3 (Do I need a common filesystem on all my nodes?) into the first few paragraphs of section 4.11 (Installation location)
  • Moved 13.4.4 (How do I add Open MPI to my PATH and LD_LIBRARY_PATH?) into section 10.2 (Prerequisites) as new section 10.2.1 (Adding Open MPI to PATH and LD_LIBRARY_PATH)
  • Moved 13.4.5 (What if I can’t modify my PATH and/or LD_LIBRARY_PATH?) into section 10.2 (Prerequisites) as new section 10.2.2 (Using the --prefix option with mpirun)
  • Integrated 13.4.6 (How do I launch Open MPI parallel jobs?) into the first few paragraphs of section 10 (Launching MPI applications)
  • Integrated 13.4.7 (How do I run a simple SPMD MPI job?) into sections 10.1.2 (Launching on a single host) and 10.1.3 (Launching in a non-scheduled environments (via ssh))
  • Moved 13.4.8 (How do I run an MPMD MPI job?) to new subsection 10.11.4 (Launching an MPMD MPI job) in section 10.11 (Unusual jobs)
  • Moved 13.4.9 (How do I specify the hosts on which my MPI job runs?) to new subsection 10.6.1 (Specifying the hosts for an MPI job), in section 10.6 (Launching with SSH)
  • Moved 13.4.10 (How can I diagnose problems when running across multiple hosts?) to new subsection 10.12.3 (Problems when running across multiple hosts) in section 10.12 (Troubleshooting)
  • Moved 13.4.11 (I get errors about missing libraries. What should I do?) to new subsection 10.12.2 (Errors about missing libraries) in section 10.12 (Troubleshooting)
  • Moved 13.4.12 (Can I run non-MPI programs with mpirun / mpiexec?) to new subsection 10.11.1 (Running non-MPI programs) in section 10.11 (Unusual jobs)
  • Moved 13.4.13 (Can I run GUI applications with Open MPI?) to new subsection 10.11.2 (Running GUI applications) in section 10.11 (Unusual jobs)
  • Moved 13.4.14 (Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?) to new subsection 10.11.3 (Running curses-based applications) in section 10.11 (Unusual jobs)
  • Moved 13.4.15 (What other options are available to mpirun?) to new subsection 10.1.1.1 (Other mpirun options) in section 10.1 (Quick start: Launching MPI applications)
  • Moved 13.4.16 (How do I use the --hostfile option to mpirun?) to new subsection 10.4.2 (Scheduling with the --hostfile option) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.17 (How do I use the --host option to mpirun?) to new subsection 10.4.3 (Scheduling with the --host option) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.18 (What are “slots”?) to new subsection 10.4.4 (Process slots) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.19 (How are the number of slots calculated?) to new subsection 10.4.4.1 (Calculating the number of slots) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.20 (How do I control how my processes are scheduled across hosts?) to new subsection 10.4.1 (Scheduling overview) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.21 (Can I oversubscribe nodes (run more processes than processors)?) to new subsection 10.4.5 (Oversubscribing nodes) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.22 (Can I force Aggressive or Degraded performance modes?) to new subsection 10.4.5.1 (Forcing aggressive or degraded performance mode) in section 10.4 (Scheduling processes across hosts)
  • Moved 13.4.23 (How do I run with the TotalView parallel debugger?) to new section 12.4 (Using Parallel Debuggers to Debug Open MPI Applications)
  • Moved 13.4.24 (How do I run with the DDT parallel debugger?) to new section 12.4 (Using Parallel Debuggers to Debug Open MPI Applications)
  • Moved 13.4.25 (How do I dynamically load libmpi at runtime? to new subsection 9.10 (Dynamically loading libmpi at runtime) in section 9 (Building MPI applications)
  • Moved 13.4.26 (What MPI environment variables exist?) to new subsection 11.1 (Environment variables set for MPI applications) in section 11 (Run-time operation and tuning MPI applications)

Fault Tolerance: (13.5)

  • Moved 13.5.1 (What is “fault tolerance”?) to the opening of new subsection 11.4 (Fault tolerance) in section 11 (Run-time operation and tuning MPI applications)
  • Moved 13.5.2 (What fault tolerance techniques has / does / will Open MPI support?) to new subsection 11.4.1 (Supported fault tolerance techniques) in section 11.4 (Fault Tolerance)
  • Moved 13.5.3 (Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?) to new subsection 11.4.2 (Checkpoint and restart of parallel jobs) in section 11.4 (Fault Tolerance)
  • Moved 13.5.4 (Where can I find the fault tolerance development work?) to new subsection 11.4.1.1 (Current fault tolerance development) in section 11.4 (Fault tolerance)
  • Moved 13.5.5 (Does Open MPI support end-to-end data reliability in MPI message passing?) to new subsection 11.4.3 (End-to-end data reliability for MPI messages) in section 11.4 (Fault Tolerance)

Troubleshooting: (13.6)

  • Moved 13.6.1 (Messages about missing symbols) to new subsection 10.12.1 (Messages about missing symbols when running my application) in section 10.12 (Troubleshooting)
  • Deleted 13.6.2 (How do I attach a parallel debugger to my MPI job?), it's covered in section 12 (Debugging Open MPI Parallel Applications)
  • Moved 13.6.3 (How do I find out what MCA parameters are being seen/used by my job?) into Section 8 (The Modular Component Architecture), as new section 8.3, between previous sections 8.2 and 8.3.

Large Clusters: (13.7)

  • Moved 13.7.1 (How do I reduce startup time for jobs on large clusters?) to to new subsection 11.5.1 (Reducing startup time for jobs on large clusters) in section 11.5 (Large clusters)
  • Moved 13.7.2 (Where should I put my libraries: Network vs. local filesystems?) to new subsection 11.5.2 (Library location: network vs. local filesystems) in section 11.5 (Large clusters)
  • Moved 13.7.3 (Static vs. shared libraries?) to new subsection 11.5.2.1 (Static vs. shared libraries) in section 11.5 (Large clusters)
  • Moved 13.7.4 (How do I reduce the time to wireup OMPI’s out-of-band communication system?) to new subsection 11.5.3 (Reducing wireup time) in section 11.5 (Large clusters)
  • Moved 13.7.5 (I know my cluster’s configuration - how can I take advantage of that knowledge?) to new subsection 11.5.4 (Static cluster configurations) in section 11.5 (Large clusters)

General Tuning: (13.8)

  • Moved 13.8.1 (How do I install my own components into an Open MPI installation?) to new subsection 11.2 (Installing custom components) in section 11 (Run-time operation and tuning MPI applications)
  • Moved 13.8.2 (What is processor affinity? Does Open MPI support it?) to to new subsection 11.6.1 (Processor affinity) in section 11.6 (Processor and memory affinity)
  • Moved 13.8.3 (What is memory affinity? Does Open MPI support it?) to to new subsection 11.6.2 (Memory affinity) in section 11.6 (Processor and memory affinity)
  • Moved 13.8.4 (How do I tell Open MPI to use processor and/or memory affinity?) to to new subsection 11.6.3 (Enabling processor and memory affinity) in section 11.6 (Processor and memory affinity)
  • Moved 13.8.5 (Does Open MPI support calling fork(), system(), or popen() in MPI processes?) to new subsection 9.11 (Calling fork(), system(), or popen() in MPI processes?) in section 9 (Building MPI applications)
  • Moved 13.8.6 (I want to run some performance benchmarks with Open MPI. How do I do that?) to new subsection 11.8 (Benchmarking Open MPI applications) in section 11 (Run-time operation and tuning MPI applications)
  • Deleted 13.8.7 (I am getting a MPI_WIN_FREE error from IMB-EXT — what do I do?), as it's about a bugggy version of the Intel MPI benchmarks from 2009.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level structure looks good in my opinion, just a few minor comments.

Copy link
Member

@abouteiller abouteiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the main ULFM FT documentation from the new fault-tolerance subsection, remove some outdated references to ulfm bitbucket.

@qkoziol qkoziol requested a review from abouteiller April 1, 2023 19:59
qkoziol and others added 2 commits April 7, 2023 12:57
Move all the content from the legacy FAQ into other parts of the docs,
updating much (but not all of it) at the same time.

Log of changes for moving FAQ questions to other locations
----------------------------------------------------------

Non-FAQ section changes:
- Corrected MPI sub-version rendering bug in section 3.5.1 (MPI Standard
    Compliance), by creating and using new 'mpi_standard_full_version'
    substitution in conf.py
- Added strikethru_start and strikethru_end substitutions in conf.py
- Broke out updating/upgrading an Open MPI installation from within section
    4.11.2 (Installing over a prior Open MPI installation) into a new section
    4.12 (Updating or Upgrading an Open MPI installation)
- s/ackwards/ackward/g in section 7 (Version numbers and compatibility)
- Added link to section 8.3 (Setting MCA parameter values) to section 3.4
    (General Run-Time Support Notes)
- Added new section 10.4 (Scheduling processes across hosts)
- Added new section 10.11 (Unusual jobs) to section 10 (Launching MPI
    applications)
- Added new section 10.12 (Troubleshooting) to section 10 (Launching MPI
    applications)
- Changed title of section 11 from (Run-time tuning MPI applications) to
    (Run-time operation and tuning MPI applications)
- Added new subsection 11.4 (Fault tolerance) to section 11 (Run-time
    operation and tuning MPI applications)
- Added new subsection 11.5 (Large clusters) to section 11 (Run-time
    operation and tuning MPI applications)
- Added new subsection 11.6 (Processor and memory affinity) to section 11
    (Run-time operation and tuning MPI applications)

FAQ section changes:

Supported Systems: (13.1)
- Moved 13.1.1 (What operating systems does Open MPI support?), 13.1.2 (What
    hardware platforms does Open MPI support?), and 13.1.3 (What network
    interconnects does Open MPI support?) into Section 4 (Building and
    Installing Open MPI) as new section 4.2 (Supported Systems), between
    previous sections 4.1 and 4.2.
- Moved 13.1.4 (How does Open MPI interface to back-end run-time systems?) to
    the top of section 10.3 (The role of PMIx and PRRTE).
- Moved 13.1.5 (What run-time environments does Open MPI support?) to the top
    of section 3.2 (Platform Notes)
- Deleted 13.1.6 (How much MPI does Open MPI support?), as it duplicates
    information in section 3.5.1 (MPI Standard Compliance)
- Moved 13.1.7 (Is Open MPI thread safe?) to section 9 (Building MPI
    Applications) as new section 9.7.
- Moved 13.1.8 (Does Open MPI support 64 bit environments?) to section 9
    (Building MPI Applications) as new section 9.8.
- Moved 13.1.9 (Does Open MPI support execution in heterogeneous environments?)
    to section 9 (Building MPI Applications) as new section 9.9.
- Moved 13.1.10 (Does Open MPI support parallel debuggers?) to section 12
    (Debugging Open MPI Parallel Applications) as new section 12.4, between
    previous sections 12.3 and 12.4.

System administrator-level technical information: (13.2)
- Moved 13.2.1 (I’m a sysadmin; what do I care about Open MPI?) to section
    4 (Building and installing Open MPI) as new section 4.14 (Advice for
    System Administrators)
- Moved 13.2.2 (Do I need multiple Open MPI installations?) to the end of
    section 4.11 (Installation Location), as new subsection 4.11.4 (Installing
    Multiple Copies of Open MPI).
- Moved 13.2.3 (What are MCA Parameters? Why would I set them?) to section
    4 (Building and installing Open MPI) as 4.14.1 (Setting Global MCA
    Parameters), within new section 4.14 (Advice for System Administrators)
- Moved 13.2.4 (Do my users need to have their own installation of Open MPI?)
    to section 4 (Building and installing Open MPI) as 4.14.5 (User
    customization of a global Open MPI installation), within section 4.14
    (Advice for System Administrators)
- Deleted 13.2.5 (I have power users who will want to override my global MCA
    parameters; is this possible?), as the information is already incorporated
    into new section 4.14.5 (User customization of a global Open MPI
    installation).
- Moved 13.2.6 (What MCA parameters should I, the system administrator, set?)
    to section 4 (Building and installing Open MPI) as 4.14.2 (Setting MCA
    Parameters for a Global Open MPI installation), within section 4.14
    (Advice for System Administrators)
- Moved 13.2.7 (I just added a new plugin to my Open MPI installation; do I
    need to recompile all my MPI apps?) to section 4 (Building and installing
    Open MPI) as 4.14.3 (Adding a new plugin to a global Open MPI installation),
    within section 4.14 (Advice for System Administrators)
- Moved 13.2.8 (I just upgraded my InfiniBand network; do I need to recompile
    all my MPI apps?) to section 4 (Building and installing Open MPI) as 4.14.4
    (Upgrading network hardware with a global Open MPI installation), within
    section 4.14 (Advice for System Administrators)
- Moved 13.2.9 (We just upgraded our version of Open MPI; do I need to
    recompile all my MPI apps?) into new section 4.12 (Updating or Upgrading an
    Open MPI installation)
- Moved 13.2.10 (I have an MPI application compiled for another MPI; will it
    work with Open MPI?) to be a warning at the top page of section 9 (Building
    MPI applications)

Building Open MPI: (13.3)
- Moved 13.3.1 (How do I statically link to the libraries of Intel compiler
    suite?) to section 4 (Building and installing Open MPI) as new section
    4.6.1 (Statically linking to the libraries of Intel compiler suite), within
    section 4.6 (Specifying Compilers and flags)
- Moved 13.3.2 (Why do I get errors about hwloc or libevent not found?) to
    section 4 (Building and installing Open MPI) as new section 4.7.5
    (Difficulties with C and Fortran), within section 4.7 (Required support
    libraries)

Running MPI Applications: (13.4)
- Moved / integrated content from 13.4.1 (What prerequisites are necessary for
    running an Open MPI job?) into section 10.2 (Prerequisites)
- Moved 13.4.2 (What ABI guarantees does Open MPI provide?) into section 7
    (Version numbers and backward compatibility)) as new section 7.2
    (Application Binary Interface (ABI) Compatibility)
- Moved / integrated content from 13.4.3 (Do I need a common filesystem on all
    my nodes?) into the first few paragraphs of section 4.11 (Installation
    location)
- Moved 13.4.4 (How do I add Open MPI to my PATH and LD_LIBRARY_PATH?) into
    section 10.2 (Prerequisites) as new section 10.2.1 (Adding Open MPI to
    PATH and LD_LIBRARY_PATH)
- Moved 13.4.5 (What if I can’t modify my PATH and/or LD_LIBRARY_PATH?) into
    section 10.2 (Prerequisites) as new section 10.2.2 (Using the --prefix
    option with mpirun)
- Integrated 13.4.6 (How do I launch Open MPI parallel jobs?) into
    the first few paragraphs of section 10 (Launching MPI applications)
- Integrated 13.4.7 (How do I run a simple SPMD MPI job?) into sections 10.1.2
    (Launching on a single host) and 10.1.3 (Launching in a non-scheduled
    environments (via ssh))
- Moved 13.4.8 (How do I run an MPMD MPI job?) to new subsection 10.11.4
    (Launching an MPMD MPI job) in section 10.11 (Unusual jobs)
- Moved 13.4.9 (How do I specify the hosts on which my MPI job runs?) to new
    subsection 10.6.1 (Specifying the hosts for an MPI job), in section 10.6
    (Launching with SSH)
- Moved 13.4.10 (How can I diagnose problems when running across multiple
    hosts?) to new subsection 10.12.3 (Problems when running across
    multiple hosts) in section 10.12 (Troubleshooting)
- Moved 13.4.11 (I get errors about missing libraries. What should I do?) to new
    subsection 10.12.2 (Errors about missing libraries) in section 10.12
    (Troubleshooting)
- Moved 13.4.12 (Can I run non-MPI programs with mpirun / mpiexec?) to new
    subsection 10.11.1 (Running non-MPI programs) in section 10.11 (Unusual
    jobs)
- Moved 13.4.13 (Can I run GUI applications with Open MPI?) to new
    subsection 10.11.2 (Running GUI applications) in section 10.11 (Unusual
    jobs)
- Moved 13.4.14 (Can I run ncurses-based / curses-based / applications with
    funky input schemes with Open MPI?) to new subsection 10.11.3 (Running
    curses-based applications) in section 10.11 (Unusual jobs)
- Moved 13.4.15 (What other options are available to mpirun?) to new subsection
    10.1.1.1 (Other mpirun options) in section 10.1 (Quick start: Launching MPI
    applications)
- Moved 13.4.16 (How do I use the --hostfile option to mpirun?) to new
    subsection 10.4.2 (Scheduling with the --hostfile option) in section 10.4
    (Scheduling processes across hosts)
- Moved 13.4.17 (How do I use the --host option to mpirun?) to new
    subsection 10.4.3 (Scheduling with the --host option) in section 10.4
    (Scheduling processes across hosts)
- Moved 13.4.18 (What are “slots”?) to new subsection 10.4.4 (Process slots)
    in section 10.4 (Scheduling processes across hosts)
- Moved 13.4.19 (How are the number of slots calculated?) to new subsection
    10.4.4.1 (Calculating the number of slots) in section 10.4 (Scheduling
    processes across hosts)
- Moved 13.4.20 (How do I control how my processes are scheduled across hosts?)
    to new subsection 10.4.1 (Scheduling overview) in section 10.4 (Scheduling
    processes across hosts)
- Moved 13.4.21 (Can I oversubscribe nodes (run more processes than
    processors)?) to new subsection 10.4.5 (Oversubscribing nodes) in section
    10.4 (Scheduling processes across hosts)
- Moved 13.4.22 (Can I force Aggressive or Degraded performance modes?) to
    new subsection 10.4.5.1 (Forcing aggressive or degraded performance mode)
    in section 10.4 (Scheduling processes across hosts)
- Moved 13.4.23 (How do I run with the TotalView parallel debugger?) to new
    section 12.4 (Using Parallel Debuggers to Debug Open MPI Applications)
- Moved 13.4.24 (How do I run with the DDT parallel debugger?) to new
    section 12.4 (Using Parallel Debuggers to Debug Open MPI Applications)
- Moved 13.4.25 (How do I dynamically load libmpi at runtime? to new
    subsection 9.10 (Dynamically loading libmpi at runtime) in section 9
    (Building MPI applications)
- Moved 13.4.26 (What MPI environment variables exist?) to new subsection
    11.1 (Environment variables set for MPI applications) in section 11
    (Run-time operation and tuning MPI applications)

Fault Tolerance: (13.5)
- Moved 13.5.1 (What is “fault tolerance”?) to the opening of new subsection
    11.4 (Fault tolerance) in section 11 (Run-time operation and tuning MPI
    applications)
- Moved 13.5.2 (What fault tolerance techniques has / does / will Open MPI
    support?) to new subsection 11.4.1 (Supported fault tolerance techniques)
    in section 11.4 (Fault Tolerance)
- Moved 13.5.3 (Does Open MPI support checkpoint and restart of parallel
    jobs (similar to LAM/MPI)?) to new subsection 11.4.2 (Checkpoint and
    restart of parallel jobs) in section 11.4 (Fault Tolerance)
- Moved 13.5.4 (Where can I find the fault tolerance development work?)
    to new subsection 11.4.1.1 (Current fault tolerance development) in
    section 11.4 (Fault tolerance)
- Moved 13.5.5 (Does Open MPI support end-to-end data reliability in MPI
    message passing?) to new subsection 11.4.3 (End-to-end data reliability
    for MPI messages) in section 11.4 (Fault Tolerance)

Troubleshooting: (13.6)
- Moved 13.6.1 (Messages about missing symbols) to new subsection 10.12.1
    (Messages about missing symbols when running my application) in section
    10.12 (Troubleshooting)
- Deleted 13.6.2 (How do I attach a parallel debugger to my MPI job?), it's
    covered in section 12 (Debugging Open MPI Parallel Applications)
- Moved 13.6.3 (How do I find out what MCA parameters are being seen/used by my
    job?) into Section 8 (The Modular Component Architecture), as new
    section 8.3, between previous sections 8.2 and 8.3.

Large Clusters: (13.7)
- Moved 13.7.1 (How do I reduce startup time for jobs on large clusters?) to
    to new subsection 11.5.1 (Reducing startup time for jobs on large clusters)
    in section 11.5 (Large clusters)
- Moved 13.7.2 (Where should I put my libraries: Network vs. local filesystems?)
    to new subsection 11.5.2 (Library location: network vs. local filesystems)
    in section 11.5 (Large clusters)
- Moved 13.7.3 (Static vs. shared libraries?) to new subsection 11.5.2.1
    (Static vs. shared libraries) in section 11.5 (Large clusters)
- Moved 13.7.4 (How do I reduce the time to wireup OMPI’s out-of-band
    communication system?) to new subsection 11.5.3 (Reducing wireup time) in
    section 11.5 (Large clusters)
- Moved 13.7.5 (I know my cluster’s configuration - how can I take advantage of
    that knowledge?) to new subsection 11.5.4 (Static cluster configurations)
    in section 11.5 (Large clusters)

General Tuning: (13.8)
- Moved 13.8.1 (How do I install my own components into an Open MPI
    installation?) to new subsection 11.2 (Installing custom components)
    in section 11 (Run-time operation and tuning MPI applications)
- Moved 13.8.2 (What is processor affinity? Does Open MPI support it?) to
    to new subsection 11.6.1 (Processor affinity) in section 11.6 (Processor and
    memory affinity)
- Moved 13.8.3 (What is memory affinity? Does Open MPI support it?) to
    to new subsection 11.6.2 (Memory affinity) in section 11.6 (Processor and
    memory affinity)
- Moved 13.8.4 (How do I tell Open MPI to use processor and/or memory affinity?) to
    to new subsection 11.6.3 (Enabling processor and memory affinity) in
    section 11.6 (Processor and memory affinity)
- Moved 13.8.5 (Does Open MPI support calling fork(), system(), or popen() in
    MPI processes?) to new subsection 9.11 (Calling fork(), system(), or popen()
    in MPI processes?) in section 9 (Building MPI applications)
- Moved 13.8.6 (I want to run some performance benchmarks with Open MPI. How do
    I do that?) to new subsection 11.8 (Benchmarking Open MPI applications)
    in section 11 (Run-time operation and tuning MPI applications)
- Deleted 13.8.7 (I am getting a MPI_WIN_FREE error from IMB-EXT — what do I
    do?), as it's about a bugggy version of the Intel MPI benchmarks from 2009.

Signed-off-by: Quincey Koziol <[email protected]>
Co-authored-by: Aurelien Bouteiller <[email protected]>
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @qkoziol in Slack; with his blessing, I squashed all his commits into 1 and rebased to the HEAD of main. Rather than make a million text suggestions here in the PR, I just made all my suggestions in a new, 2nd commit. If these suggestions are amenable, we can squash this 2nd commit into the original commit and then merge this PR.

@@ -1,3 +1,5 @@
.. _using-mpir-based-tools-label:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that the example given in this file of how to build the MPIR shim is still correct?

Should we just point people to the MPIR shim instructions (vs. giving our own example of how to build it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check if this is correct on Monday (4/10).

Comment on lines 15 to 22
* |strikethru_start| Message logging techniques. Similar to those
implemented in MPICH-V |strikethru_end|
* |strikethru_start| Data Reliability and network fault tolerance. Similar
to those implemented in LA-MPI |strikethru_end|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that these are crossed out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, they were there in your original conversion of the file to RST: e4db776#diff-58f725531a70e71f53629d75173f2fe204dd7fb393584762fce5b64526545fae

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to determine where that info came from before your commit, but I can't find a pattern in the new RST file that is in the repo before your commit. Was that file added new for the switchover to RST?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you pulled it from faq/fault-tolerance.rst (where it used <strike> / </strike>, not the new macros).

@abouteiller What does it mean that these 2 items are struck out in the FT doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly true. However, I don't know why they are crossed out originally. If that's needed, I could use help to determine what original file was used as input for creating faq/fault-tolerance.rst... Hopefully @abouteiller can tell us... :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The features are not available for general use

  1. at all, for the case of data reliability (this has been removed when the PML DR got removed d692aba )
  2. not suitable for production use, only for research (pml/v component)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

  • For 1, can we remove the bullet altogether?
  • For 2, can we not strike it out, but rather say that it's only for research usage, ... etc.?

I think the strikeout is ambiguous as to what it actually means.

Once the strikout is removed, those 2 new macros should be removed, too.

Copy link
Member

@jsquyres jsquyres Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abouteiller In an effort to keep this PR moving, I pushed a 3rd commit as a suggestion for this FT content. Can you review? You can see the result here: https://ompi--11531.org.readthedocs.build/en/11531/tuning-apps/fault-tolerance/supported.html

If acceptable, all the commits in this PR should be squashed so that we can merge this PR.

@wckzhang wckzhang merged commit da6d715 into open-mpi:main Apr 20, 2023
@jsquyres
Copy link
Member

Just so that it's in the record: I'm annoyed that this PR was merged without squashing the 2 commits that a) I explicitly asked to be squashed, and b) are explicitly labeled "SQUASHME".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants