-
Notifications
You must be signed in to change notification settings - Fork 903
Retire the FAQ section of the docs, moving its content to other locations #11531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level structure looks good in my opinion, just a few minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the main ULFM FT documentation from the new fault-tolerance subsection, remove some outdated references to ulfm bitbucket.
Move all the content from the legacy FAQ into other parts of the docs, updating much (but not all of it) at the same time. Log of changes for moving FAQ questions to other locations ---------------------------------------------------------- Non-FAQ section changes: - Corrected MPI sub-version rendering bug in section 3.5.1 (MPI Standard Compliance), by creating and using new 'mpi_standard_full_version' substitution in conf.py - Added strikethru_start and strikethru_end substitutions in conf.py - Broke out updating/upgrading an Open MPI installation from within section 4.11.2 (Installing over a prior Open MPI installation) into a new section 4.12 (Updating or Upgrading an Open MPI installation) - s/ackwards/ackward/g in section 7 (Version numbers and compatibility) - Added link to section 8.3 (Setting MCA parameter values) to section 3.4 (General Run-Time Support Notes) - Added new section 10.4 (Scheduling processes across hosts) - Added new section 10.11 (Unusual jobs) to section 10 (Launching MPI applications) - Added new section 10.12 (Troubleshooting) to section 10 (Launching MPI applications) - Changed title of section 11 from (Run-time tuning MPI applications) to (Run-time operation and tuning MPI applications) - Added new subsection 11.4 (Fault tolerance) to section 11 (Run-time operation and tuning MPI applications) - Added new subsection 11.5 (Large clusters) to section 11 (Run-time operation and tuning MPI applications) - Added new subsection 11.6 (Processor and memory affinity) to section 11 (Run-time operation and tuning MPI applications) FAQ section changes: Supported Systems: (13.1) - Moved 13.1.1 (What operating systems does Open MPI support?), 13.1.2 (What hardware platforms does Open MPI support?), and 13.1.3 (What network interconnects does Open MPI support?) into Section 4 (Building and Installing Open MPI) as new section 4.2 (Supported Systems), between previous sections 4.1 and 4.2. - Moved 13.1.4 (How does Open MPI interface to back-end run-time systems?) to the top of section 10.3 (The role of PMIx and PRRTE). - Moved 13.1.5 (What run-time environments does Open MPI support?) to the top of section 3.2 (Platform Notes) - Deleted 13.1.6 (How much MPI does Open MPI support?), as it duplicates information in section 3.5.1 (MPI Standard Compliance) - Moved 13.1.7 (Is Open MPI thread safe?) to section 9 (Building MPI Applications) as new section 9.7. - Moved 13.1.8 (Does Open MPI support 64 bit environments?) to section 9 (Building MPI Applications) as new section 9.8. - Moved 13.1.9 (Does Open MPI support execution in heterogeneous environments?) to section 9 (Building MPI Applications) as new section 9.9. - Moved 13.1.10 (Does Open MPI support parallel debuggers?) to section 12 (Debugging Open MPI Parallel Applications) as new section 12.4, between previous sections 12.3 and 12.4. System administrator-level technical information: (13.2) - Moved 13.2.1 (I’m a sysadmin; what do I care about Open MPI?) to section 4 (Building and installing Open MPI) as new section 4.14 (Advice for System Administrators) - Moved 13.2.2 (Do I need multiple Open MPI installations?) to the end of section 4.11 (Installation Location), as new subsection 4.11.4 (Installing Multiple Copies of Open MPI). - Moved 13.2.3 (What are MCA Parameters? Why would I set them?) to section 4 (Building and installing Open MPI) as 4.14.1 (Setting Global MCA Parameters), within new section 4.14 (Advice for System Administrators) - Moved 13.2.4 (Do my users need to have their own installation of Open MPI?) to section 4 (Building and installing Open MPI) as 4.14.5 (User customization of a global Open MPI installation), within section 4.14 (Advice for System Administrators) - Deleted 13.2.5 (I have power users who will want to override my global MCA parameters; is this possible?), as the information is already incorporated into new section 4.14.5 (User customization of a global Open MPI installation). - Moved 13.2.6 (What MCA parameters should I, the system administrator, set?) to section 4 (Building and installing Open MPI) as 4.14.2 (Setting MCA Parameters for a Global Open MPI installation), within section 4.14 (Advice for System Administrators) - Moved 13.2.7 (I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?) to section 4 (Building and installing Open MPI) as 4.14.3 (Adding a new plugin to a global Open MPI installation), within section 4.14 (Advice for System Administrators) - Moved 13.2.8 (I just upgraded my InfiniBand network; do I need to recompile all my MPI apps?) to section 4 (Building and installing Open MPI) as 4.14.4 (Upgrading network hardware with a global Open MPI installation), within section 4.14 (Advice for System Administrators) - Moved 13.2.9 (We just upgraded our version of Open MPI; do I need to recompile all my MPI apps?) into new section 4.12 (Updating or Upgrading an Open MPI installation) - Moved 13.2.10 (I have an MPI application compiled for another MPI; will it work with Open MPI?) to be a warning at the top page of section 9 (Building MPI applications) Building Open MPI: (13.3) - Moved 13.3.1 (How do I statically link to the libraries of Intel compiler suite?) to section 4 (Building and installing Open MPI) as new section 4.6.1 (Statically linking to the libraries of Intel compiler suite), within section 4.6 (Specifying Compilers and flags) - Moved 13.3.2 (Why do I get errors about hwloc or libevent not found?) to section 4 (Building and installing Open MPI) as new section 4.7.5 (Difficulties with C and Fortran), within section 4.7 (Required support libraries) Running MPI Applications: (13.4) - Moved / integrated content from 13.4.1 (What prerequisites are necessary for running an Open MPI job?) into section 10.2 (Prerequisites) - Moved 13.4.2 (What ABI guarantees does Open MPI provide?) into section 7 (Version numbers and backward compatibility)) as new section 7.2 (Application Binary Interface (ABI) Compatibility) - Moved / integrated content from 13.4.3 (Do I need a common filesystem on all my nodes?) into the first few paragraphs of section 4.11 (Installation location) - Moved 13.4.4 (How do I add Open MPI to my PATH and LD_LIBRARY_PATH?) into section 10.2 (Prerequisites) as new section 10.2.1 (Adding Open MPI to PATH and LD_LIBRARY_PATH) - Moved 13.4.5 (What if I can’t modify my PATH and/or LD_LIBRARY_PATH?) into section 10.2 (Prerequisites) as new section 10.2.2 (Using the --prefix option with mpirun) - Integrated 13.4.6 (How do I launch Open MPI parallel jobs?) into the first few paragraphs of section 10 (Launching MPI applications) - Integrated 13.4.7 (How do I run a simple SPMD MPI job?) into sections 10.1.2 (Launching on a single host) and 10.1.3 (Launching in a non-scheduled environments (via ssh)) - Moved 13.4.8 (How do I run an MPMD MPI job?) to new subsection 10.11.4 (Launching an MPMD MPI job) in section 10.11 (Unusual jobs) - Moved 13.4.9 (How do I specify the hosts on which my MPI job runs?) to new subsection 10.6.1 (Specifying the hosts for an MPI job), in section 10.6 (Launching with SSH) - Moved 13.4.10 (How can I diagnose problems when running across multiple hosts?) to new subsection 10.12.3 (Problems when running across multiple hosts) in section 10.12 (Troubleshooting) - Moved 13.4.11 (I get errors about missing libraries. What should I do?) to new subsection 10.12.2 (Errors about missing libraries) in section 10.12 (Troubleshooting) - Moved 13.4.12 (Can I run non-MPI programs with mpirun / mpiexec?) to new subsection 10.11.1 (Running non-MPI programs) in section 10.11 (Unusual jobs) - Moved 13.4.13 (Can I run GUI applications with Open MPI?) to new subsection 10.11.2 (Running GUI applications) in section 10.11 (Unusual jobs) - Moved 13.4.14 (Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?) to new subsection 10.11.3 (Running curses-based applications) in section 10.11 (Unusual jobs) - Moved 13.4.15 (What other options are available to mpirun?) to new subsection 10.1.1.1 (Other mpirun options) in section 10.1 (Quick start: Launching MPI applications) - Moved 13.4.16 (How do I use the --hostfile option to mpirun?) to new subsection 10.4.2 (Scheduling with the --hostfile option) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.17 (How do I use the --host option to mpirun?) to new subsection 10.4.3 (Scheduling with the --host option) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.18 (What are “slots”?) to new subsection 10.4.4 (Process slots) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.19 (How are the number of slots calculated?) to new subsection 10.4.4.1 (Calculating the number of slots) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.20 (How do I control how my processes are scheduled across hosts?) to new subsection 10.4.1 (Scheduling overview) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.21 (Can I oversubscribe nodes (run more processes than processors)?) to new subsection 10.4.5 (Oversubscribing nodes) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.22 (Can I force Aggressive or Degraded performance modes?) to new subsection 10.4.5.1 (Forcing aggressive or degraded performance mode) in section 10.4 (Scheduling processes across hosts) - Moved 13.4.23 (How do I run with the TotalView parallel debugger?) to new section 12.4 (Using Parallel Debuggers to Debug Open MPI Applications) - Moved 13.4.24 (How do I run with the DDT parallel debugger?) to new section 12.4 (Using Parallel Debuggers to Debug Open MPI Applications) - Moved 13.4.25 (How do I dynamically load libmpi at runtime? to new subsection 9.10 (Dynamically loading libmpi at runtime) in section 9 (Building MPI applications) - Moved 13.4.26 (What MPI environment variables exist?) to new subsection 11.1 (Environment variables set for MPI applications) in section 11 (Run-time operation and tuning MPI applications) Fault Tolerance: (13.5) - Moved 13.5.1 (What is “fault tolerance”?) to the opening of new subsection 11.4 (Fault tolerance) in section 11 (Run-time operation and tuning MPI applications) - Moved 13.5.2 (What fault tolerance techniques has / does / will Open MPI support?) to new subsection 11.4.1 (Supported fault tolerance techniques) in section 11.4 (Fault Tolerance) - Moved 13.5.3 (Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)?) to new subsection 11.4.2 (Checkpoint and restart of parallel jobs) in section 11.4 (Fault Tolerance) - Moved 13.5.4 (Where can I find the fault tolerance development work?) to new subsection 11.4.1.1 (Current fault tolerance development) in section 11.4 (Fault tolerance) - Moved 13.5.5 (Does Open MPI support end-to-end data reliability in MPI message passing?) to new subsection 11.4.3 (End-to-end data reliability for MPI messages) in section 11.4 (Fault Tolerance) Troubleshooting: (13.6) - Moved 13.6.1 (Messages about missing symbols) to new subsection 10.12.1 (Messages about missing symbols when running my application) in section 10.12 (Troubleshooting) - Deleted 13.6.2 (How do I attach a parallel debugger to my MPI job?), it's covered in section 12 (Debugging Open MPI Parallel Applications) - Moved 13.6.3 (How do I find out what MCA parameters are being seen/used by my job?) into Section 8 (The Modular Component Architecture), as new section 8.3, between previous sections 8.2 and 8.3. Large Clusters: (13.7) - Moved 13.7.1 (How do I reduce startup time for jobs on large clusters?) to to new subsection 11.5.1 (Reducing startup time for jobs on large clusters) in section 11.5 (Large clusters) - Moved 13.7.2 (Where should I put my libraries: Network vs. local filesystems?) to new subsection 11.5.2 (Library location: network vs. local filesystems) in section 11.5 (Large clusters) - Moved 13.7.3 (Static vs. shared libraries?) to new subsection 11.5.2.1 (Static vs. shared libraries) in section 11.5 (Large clusters) - Moved 13.7.4 (How do I reduce the time to wireup OMPI’s out-of-band communication system?) to new subsection 11.5.3 (Reducing wireup time) in section 11.5 (Large clusters) - Moved 13.7.5 (I know my cluster’s configuration - how can I take advantage of that knowledge?) to new subsection 11.5.4 (Static cluster configurations) in section 11.5 (Large clusters) General Tuning: (13.8) - Moved 13.8.1 (How do I install my own components into an Open MPI installation?) to new subsection 11.2 (Installing custom components) in section 11 (Run-time operation and tuning MPI applications) - Moved 13.8.2 (What is processor affinity? Does Open MPI support it?) to to new subsection 11.6.1 (Processor affinity) in section 11.6 (Processor and memory affinity) - Moved 13.8.3 (What is memory affinity? Does Open MPI support it?) to to new subsection 11.6.2 (Memory affinity) in section 11.6 (Processor and memory affinity) - Moved 13.8.4 (How do I tell Open MPI to use processor and/or memory affinity?) to to new subsection 11.6.3 (Enabling processor and memory affinity) in section 11.6 (Processor and memory affinity) - Moved 13.8.5 (Does Open MPI support calling fork(), system(), or popen() in MPI processes?) to new subsection 9.11 (Calling fork(), system(), or popen() in MPI processes?) in section 9 (Building MPI applications) - Moved 13.8.6 (I want to run some performance benchmarks with Open MPI. How do I do that?) to new subsection 11.8 (Benchmarking Open MPI applications) in section 11 (Run-time operation and tuning MPI applications) - Deleted 13.8.7 (I am getting a MPI_WIN_FREE error from IMB-EXT — what do I do?), as it's about a bugggy version of the Intel MPI benchmarks from 2009. Signed-off-by: Quincey Koziol <[email protected]> Co-authored-by: Aurelien Bouteiller <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with @qkoziol in Slack; with his blessing, I squashed all his commits into 1 and rebased to the HEAD of main
. Rather than make a million text suggestions here in the PR, I just made all my suggestions in a new, 2nd commit. If these suggestions are amenable, we can squash this 2nd commit into the original commit and then merge this PR.
@@ -1,3 +1,5 @@ | |||
.. _using-mpir-based-tools-label: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know that the example given in this file of how to build the MPIR shim is still correct?
Should we just point people to the MPIR shim instructions (vs. giving our own example of how to build it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check if this is correct on Monday (4/10).
* |strikethru_start| Message logging techniques. Similar to those | ||
implemented in MPICH-V |strikethru_end| | ||
* |strikethru_start| Data Reliability and network fault tolerance. Similar | ||
to those implemented in LA-MPI |strikethru_end| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean that these are crossed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, they were there in your original conversion of the file to RST: e4db776#diff-58f725531a70e71f53629d75173f2fe204dd7fb393584762fce5b64526545fae
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to determine where that info came from before your commit, but I can't find a pattern in the new RST file that is in the repo before your commit. Was that file added new for the switchover to RST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you pulled it from faq/fault-tolerance.rst (where it used <strike>
/ </strike>
, not the new macros).
@abouteiller What does it mean that these 2 items are struck out in the FT doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly true. However, I don't know why they are crossed out originally. If that's needed, I could use help to determine what original file was used as input for creating faq/fault-tolerance.rst... Hopefully @abouteiller can tell us... :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The features are not available for general use
- at all, for the case of data reliability (this has been removed when the PML DR got removed d692aba )
- not suitable for production use, only for research (pml/v component)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
- For 1, can we remove the bullet altogether?
- For 2, can we not strike it out, but rather say that it's only for research usage, ... etc.?
I think the strikeout is ambiguous as to what it actually means.
Once the strikout is removed, those 2 new macros should be removed, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abouteiller In an effort to keep this PR moving, I pushed a 3rd commit as a suggestion for this FT content. Can you review? You can see the result here: https://ompi--11531.org.readthedocs.build/en/11531/tuning-apps/fault-tolerance/supported.html
If acceptable, all the commits in this PR should be squashed so that we can merge this PR.
Signed-off-by: Jeff Squyres <[email protected]>
Just so that it's in the record: I'm annoyed that this PR was merged without squashing the 2 commits that a) I explicitly asked to be squashed, and b) are explicitly labeled "SQUASHME". |
Log of changes for moving FAQ questions to other locations
Non-FAQ section changes:
FAQ section changes:
Supported Systems: (13.1)
System administrator-level technical information: (13.2)
Building Open MPI: (13.3)
Running MPI Applications: (13.4)
Fault Tolerance: (13.5)
Troubleshooting: (13.6)
Large Clusters: (13.7)
General Tuning: (13.8)