docs structure, moving docstring

pmcurtin · pmcurtin · commit cd1a8955aa71 · 2024-07-15T14:57:16.000-04:00
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,3 +1,4 @@
 sphinx==6.2.1
 furo
-myst-parser
+myst-parser
+sphinx-toolbox
diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
@@ -0,0 +1,38 @@
+Advanced Usage
+==============
+
+In addition to ``torchrunx.launch``, we provide the ``torchrunx.Launcher`` dataclass. This allows **torchrunx** arguments to be more easily populated by CLI packages like ``tyro``:
+
+.. code:: python
+
+    import torchrunx as trx
+    import tyro
+
+    def distributed_function():
+        print("Hello world!")
+
+    if __name__ == "__main__":
+        launcher = tyro.cli(trx.Launcher)
+        launcher.run(distributed_function, {})
+
+.. autoclass:: torchrunx.Launcher
+
+.. autofunction:: torchrunx.Launcher.run
+
+Logging 
+-------
+
+All logs are generated in the folder provided as the ``logs`` argument to :mod:`torchrunx.launch`. Each worker agent generates a log, named based on the current date and time, followed by the agent hostname. Each worker also has a log, named identically to their agent's log file except for the addition of the worker's local rank at the end of the name. Each agent includes the output from local worker 0 in its log. The launcher renders agent 0's log to ``stdout`` in real time.
+
+.. 
+    TODO: example log structure
+
+Worker environment
+------------------
+
+The :mod:`torchrunx.launch` ``env_vars`` argument allows the user to specify which evnironmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
+
+:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
+
+..
+    TODO: example env_file
diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -1,5 +1,8 @@
 API
 =============
 
+..
+    TODO: examples, environmental variables available to workers (e.g. RANK, LOCAL_RANK)
+
 .. automodule:: torchrunx
-    :members:
+    :members: launch, slurm_hosts, slurm_workers
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -17,9 +17,14 @@
     'sphinx.ext.autodoc',
     'sphinx.ext.autosummary',
     'sphinx.ext.intersphinx',
-    "myst_parser",
+    'myst_parser',
+    'sphinx_toolbox.sidebar_links',
+    'sphinx_toolbox.github',
 ]
 
+github_username = 'apoorvkh'
+github_repository = 'torchrunx'
+
 autodoc_mock_imports = ['torch', 'fabric', 'cloudpickle', 'typing_extensions']
 
 intersphinx_mapping = {
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -0,0 +1,17 @@
+Contributing
+============
+
+Development environment
+-----------------------
+
+Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing. 
+
+Testing 
+-------
+
+``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.
+
+Contributing
+------------
+
+Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**. 
diff --git a/docs/source/how_it_works.rst b/docs/source/how_it_works.rst
@@ -0,0 +1,15 @@
+How it works
+============
+
+In order to organize processes on different nodes, **torchrunx** maintains the following hierarchy:
+
+#. The launcher, the process in which ``torchrunx.Launcher.run`` is executed: Connects to remote hosts and initializes and configures "agents", passes errors and return values from agents to the caller, and is responsible for cleaning up.
+#. The agents, initialized on machines where computation is to be performed: Responsible for starting and monitoring "workers".
+#. The workers, spawned by agents: Responsible for initializing a ``torch.distributed`` process group, and running the distributed function provided by the user.
+
+An example of how this hierarchy might look in practice is the following: 
+Suppose we wish to distribute a training function over four GPUs, and we have access to a cluster where nodes have two available GPUs each. Say that a single instance of our training function can leverage multiple GPUs. We can choose two available nodes and use the launcher to launch our function on those two nodes, specifying that we only need one worker per node, since a single instance of our training function can use both GPUs on each node. The launcher will launch an agent on each node and pass our configuration to the agents, after which the agents will each initialize one worker to begin executing the training function. We could also run two workers per node, each with one GPU, giving us four workers, although this would be slower. 
+
+The launcher initializes the agents by simply SSHing into the provided hosts, and executing our agent code there. The launcher also provides key environmental variables from the launch environment to the sessions where the agents are started and tries to activate the same Python environment that was used to execute the launcher. This is one reason why all machines either running a launcher or agent process should share a filesystem.
+
+The launcher and agents perform exception handling such that any exceptions in the worker processes are appropriately raised by the launcher process. The launcher and agents communicate using a ``torch.distributed`` process group, separate from the group that the workers use. 
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,5 +1,5 @@
-Welcome to torchrunx's documentation!
-=====================================
+Getting Started
+===============
 
 .. include:: ../../README.md
    :parser: myst_parser.sphinx_
@@ -10,4 +10,11 @@ Contents
 .. toctree::
    :maxdepth: 2
 
-   api
+   api
+   advanced
+   how_it_works
+   contributing
+
+.. sidebar-links::
+   :github: 
+   :pypi: torchrunx
diff --git a/src/torchrunx/launcher.py b/src/torchrunx/launcher.py
@@ -103,30 +103,16 @@ def run(
         func_kwargs: dict[str, Any],
     ) -> dict[int, Any]:
         """
-        Launch a distributed pytorch function on the specified nodes.
+        Launch a distributed PyTorch function on the specified nodes. See :mod:`torchrunx.launch`
 
         :param func: The distributed function to call on all workers
         :type func: Callable
         :param func_kwargs: Any keyword arguments to be provided when calling ``func``
         :type func_kwargs: dict[str, Any]
-        :param hostnames: A list of node hostnames to start workers on, defaults to ["localhost"]
-        :type hostnames: list[str], optional
-        :param workers_per_host: The number of workers per node. Providing an ``int`` implies all nodes should have ``workers_per_host`` workers, meanwhile providing a list causes node ``i`` to have ``worker_per_host[i]`` workers, defaults to 1
-        :type workers_per_host: int | list[int], optional
-        :param ssh_config_file: An SSH configuration file to use when connecting to nodes, defaults to None
-        :type ssh_config_file: str | os.PathLike | None, optional
-        :param backend: A ``torch.distributed`` `backend string <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Backend>`_, defaults to None
-        :type backend: Literal['mpi', 'gloo', 'nccl', 'ucc', None], optional
-        :param log_dir: A directory in which logs should be written, defaults to "./logs"
-        :type log_dir: os.PathLike | str, optional
-        :param env_vars: A list of environmental variables to be copied from the launcher environment to workers. Allows for bash pattern matching syntax, defaults to ["PATH", "LD_LIBRARY", "LIBRARY_PATH", "PYTHON*", "CUDA*", "TORCH*", "PYTORCH*", "NCCL*"]
-        :type env_vars: list[str], optional
-        :param env_file: An additional environment file that will be sourced prior to executing ``func``, defaults to None
-        :type env_file: str | os.PathLike | None, optional
         :raises RuntimeError: May fail due to misconfiguration, or errors thrown by ``func``
         :return: A dictionary mapping worker ranks to their output
         :rtype: dict[int, Any]
-        """  # noqa: E501
+        """
         if not dist.is_available():
             raise RuntimeError("The torch.distributed package is not available.")
 
@@ -284,7 +270,32 @@ def launch(
         "NCCL*",
     ],
     env_file: str | os.PathLike | None = None,
-):
+) -> dict[int, Any]:
+    """
+    Launch a distributed PyTorch function on the specified nodes.
+
+    :param func: The distributed function to call on all workers
+    :type func: Callable
+    :param func_kwargs: Any keyword arguments to be provided when calling ``func``
+    :type func_kwargs: dict[str, Any]
+    :param hostnames: A list of node hostnames to start workers on, defaults to ["localhost"]
+    :type hostnames: list[str], optional
+    :param workers_per_host: The number of workers per node. Providing an ``int`` implies all nodes should have ``workers_per_host`` workers, meanwhile providing a list causes node ``i`` to have ``worker_per_host[i]`` workers, defaults to 1
+    :type workers_per_host: int | list[int], optional
+    :param ssh_config_file: An SSH configuration file to use when connecting to nodes, defaults to None
+    :type ssh_config_file: str | os.PathLike | None, optional
+    :param backend: A ``torch.distributed`` `backend string <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Backend>`_, defaults to None
+    :type backend: Literal['mpi', 'gloo', 'nccl', 'ucc', None], optional
+    :param log_dir: A directory in which logs should be written, defaults to "./logs"
+    :type log_dir: os.PathLike | str, optional
+    :param env_vars: A list of environmental variables to be copied from the launcher environment to workers. Allows for bash pattern matching syntax, defaults to ["PATH", "LD_LIBRARY", "LIBRARY_PATH", "PYTHON*", "CUDA*", "TORCH*", "PYTORCH*", "NCCL*"]
+    :type env_vars: list[str], optional
+    :param env_file: An additional environment file that will be sourced prior to executing ``func``, defaults to None
+    :type env_file: str | os.PathLike | None, optional
+    :raises RuntimeError: May fail due to misconfiguration, or errors thrown by ``func``
+    :return: A dictionary mapping worker ranks to their output
+    :rtype: dict[int, Any]
+    """  # noqa: E501
     return Launcher(
         hostnames=hostnames,
         workers_per_host=workers_per_host,
diff --git a/src/torchrunx/slurm.py b/src/torchrunx/slurm.py
@@ -1,8 +1,10 @@
+from __future__ import annotations
+
 import os
 import subprocess
 
 
-def slurm_hosts() -> "list[str]":
+def slurm_hosts() -> list[str]:
     """Retrieves hostnames of Slurm-allocated nodes.
 
     :return: Hostnames of nodes in current Slurm allocation