Skip to content

Commit cd1a895

Browse files
committed
docs structure, moving docstring
1 parent 4cbc660 commit cd1a895

File tree

9 files changed

+123
-24
lines changed

9 files changed

+123
-24
lines changed

docs/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
sphinx==6.2.1
22
furo
3-
myst-parser
3+
myst-parser
4+
sphinx-toolbox

docs/source/advanced.rst

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
Advanced Usage
2+
==============
3+
4+
In addition to ``torchrunx.launch``, we provide the ``torchrunx.Launcher`` dataclass. This allows **torchrunx** arguments to be more easily populated by CLI packages like ``tyro``:
5+
6+
.. code:: python
7+
8+
import torchrunx as trx
9+
import tyro
10+
11+
def distributed_function():
12+
print("Hello world!")
13+
14+
if __name__ == "__main__":
15+
launcher = tyro.cli(trx.Launcher)
16+
launcher.run(distributed_function, {})
17+
18+
.. autoclass:: torchrunx.Launcher
19+
20+
.. autofunction:: torchrunx.Launcher.run
21+
22+
Logging
23+
-------
24+
25+
All logs are generated in the folder provided as the ``logs`` argument to :mod:`torchrunx.launch`. Each worker agent generates a log, named based on the current date and time, followed by the agent hostname. Each worker also has a log, named identically to their agent's log file except for the addition of the worker's local rank at the end of the name. Each agent includes the output from local worker 0 in its log. The launcher renders agent 0's log to ``stdout`` in real time.
26+
27+
..
28+
TODO: example log structure
29+
30+
Worker environment
31+
------------------
32+
33+
The :mod:`torchrunx.launch` ``env_vars`` argument allows the user to specify which evnironmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
34+
35+
:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
36+
37+
..
38+
TODO: example env_file

docs/source/api.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
API
22
=============
33

4+
..
5+
TODO: examples, environmental variables available to workers (e.g. RANK, LOCAL_RANK)
6+
47
.. automodule:: torchrunx
5-
:members:
8+
:members: launch, slurm_hosts, slurm_workers

docs/source/conf.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,14 @@
1717
'sphinx.ext.autodoc',
1818
'sphinx.ext.autosummary',
1919
'sphinx.ext.intersphinx',
20-
"myst_parser",
20+
'myst_parser',
21+
'sphinx_toolbox.sidebar_links',
22+
'sphinx_toolbox.github',
2123
]
2224

25+
github_username = 'apoorvkh'
26+
github_repository = 'torchrunx'
27+
2328
autodoc_mock_imports = ['torch', 'fabric', 'cloudpickle', 'typing_extensions']
2429

2530
intersphinx_mapping = {

docs/source/contributing.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Contributing
2+
============
3+
4+
Development environment
5+
-----------------------
6+
7+
Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing.
8+
9+
Testing
10+
-------
11+
12+
``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.
13+
14+
Contributing
15+
------------
16+
17+
Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.

docs/source/how_it_works.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
How it works
2+
============
3+
4+
In order to organize processes on different nodes, **torchrunx** maintains the following hierarchy:
5+
6+
#. The launcher, the process in which ``torchrunx.Launcher.run`` is executed: Connects to remote hosts and initializes and configures "agents", passes errors and return values from agents to the caller, and is responsible for cleaning up.
7+
#. The agents, initialized on machines where computation is to be performed: Responsible for starting and monitoring "workers".
8+
#. The workers, spawned by agents: Responsible for initializing a ``torch.distributed`` process group, and running the distributed function provided by the user.
9+
10+
An example of how this hierarchy might look in practice is the following:
11+
Suppose we wish to distribute a training function over four GPUs, and we have access to a cluster where nodes have two available GPUs each. Say that a single instance of our training function can leverage multiple GPUs. We can choose two available nodes and use the launcher to launch our function on those two nodes, specifying that we only need one worker per node, since a single instance of our training function can use both GPUs on each node. The launcher will launch an agent on each node and pass our configuration to the agents, after which the agents will each initialize one worker to begin executing the training function. We could also run two workers per node, each with one GPU, giving us four workers, although this would be slower.
12+
13+
The launcher initializes the agents by simply SSHing into the provided hosts, and executing our agent code there. The launcher also provides key environmental variables from the launch environment to the sessions where the agents are started and tries to activate the same Python environment that was used to execute the launcher. This is one reason why all machines either running a launcher or agent process should share a filesystem.
14+
15+
The launcher and agents perform exception handling such that any exceptions in the worker processes are appropriately raised by the launcher process. The launcher and agents communicate using a ``torch.distributed`` process group, separate from the group that the workers use.

docs/source/index.rst

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Welcome to torchrunx's documentation!
2-
=====================================
1+
Getting Started
2+
===============
33

44
.. include:: ../../README.md
55
:parser: myst_parser.sphinx_
@@ -10,4 +10,11 @@ Contents
1010
.. toctree::
1111
:maxdepth: 2
1212

13-
api
13+
api
14+
advanced
15+
how_it_works
16+
contributing
17+
18+
.. sidebar-links::
19+
:github:
20+
:pypi: torchrunx

src/torchrunx/launcher.py

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -103,30 +103,16 @@ def run(
103103
func_kwargs: dict[str, Any],
104104
) -> dict[int, Any]:
105105
"""
106-
Launch a distributed pytorch function on the specified nodes.
106+
Launch a distributed PyTorch function on the specified nodes. See :mod:`torchrunx.launch`
107107
108108
:param func: The distributed function to call on all workers
109109
:type func: Callable
110110
:param func_kwargs: Any keyword arguments to be provided when calling ``func``
111111
:type func_kwargs: dict[str, Any]
112-
:param hostnames: A list of node hostnames to start workers on, defaults to ["localhost"]
113-
:type hostnames: list[str], optional
114-
:param workers_per_host: The number of workers per node. Providing an ``int`` implies all nodes should have ``workers_per_host`` workers, meanwhile providing a list causes node ``i`` to have ``worker_per_host[i]`` workers, defaults to 1
115-
:type workers_per_host: int | list[int], optional
116-
:param ssh_config_file: An SSH configuration file to use when connecting to nodes, defaults to None
117-
:type ssh_config_file: str | os.PathLike | None, optional
118-
:param backend: A ``torch.distributed`` `backend string <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Backend>`_, defaults to None
119-
:type backend: Literal['mpi', 'gloo', 'nccl', 'ucc', None], optional
120-
:param log_dir: A directory in which logs should be written, defaults to "./logs"
121-
:type log_dir: os.PathLike | str, optional
122-
:param env_vars: A list of environmental variables to be copied from the launcher environment to workers. Allows for bash pattern matching syntax, defaults to ["PATH", "LD_LIBRARY", "LIBRARY_PATH", "PYTHON*", "CUDA*", "TORCH*", "PYTORCH*", "NCCL*"]
123-
:type env_vars: list[str], optional
124-
:param env_file: An additional environment file that will be sourced prior to executing ``func``, defaults to None
125-
:type env_file: str | os.PathLike | None, optional
126112
:raises RuntimeError: May fail due to misconfiguration, or errors thrown by ``func``
127113
:return: A dictionary mapping worker ranks to their output
128114
:rtype: dict[int, Any]
129-
""" # noqa: E501
115+
"""
130116
if not dist.is_available():
131117
raise RuntimeError("The torch.distributed package is not available.")
132118

@@ -284,7 +270,32 @@ def launch(
284270
"NCCL*",
285271
],
286272
env_file: str | os.PathLike | None = None,
287-
):
273+
) -> dict[int, Any]:
274+
"""
275+
Launch a distributed PyTorch function on the specified nodes.
276+
277+
:param func: The distributed function to call on all workers
278+
:type func: Callable
279+
:param func_kwargs: Any keyword arguments to be provided when calling ``func``
280+
:type func_kwargs: dict[str, Any]
281+
:param hostnames: A list of node hostnames to start workers on, defaults to ["localhost"]
282+
:type hostnames: list[str], optional
283+
:param workers_per_host: The number of workers per node. Providing an ``int`` implies all nodes should have ``workers_per_host`` workers, meanwhile providing a list causes node ``i`` to have ``worker_per_host[i]`` workers, defaults to 1
284+
:type workers_per_host: int | list[int], optional
285+
:param ssh_config_file: An SSH configuration file to use when connecting to nodes, defaults to None
286+
:type ssh_config_file: str | os.PathLike | None, optional
287+
:param backend: A ``torch.distributed`` `backend string <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Backend>`_, defaults to None
288+
:type backend: Literal['mpi', 'gloo', 'nccl', 'ucc', None], optional
289+
:param log_dir: A directory in which logs should be written, defaults to "./logs"
290+
:type log_dir: os.PathLike | str, optional
291+
:param env_vars: A list of environmental variables to be copied from the launcher environment to workers. Allows for bash pattern matching syntax, defaults to ["PATH", "LD_LIBRARY", "LIBRARY_PATH", "PYTHON*", "CUDA*", "TORCH*", "PYTORCH*", "NCCL*"]
292+
:type env_vars: list[str], optional
293+
:param env_file: An additional environment file that will be sourced prior to executing ``func``, defaults to None
294+
:type env_file: str | os.PathLike | None, optional
295+
:raises RuntimeError: May fail due to misconfiguration, or errors thrown by ``func``
296+
:return: A dictionary mapping worker ranks to their output
297+
:rtype: dict[int, Any]
298+
""" # noqa: E501
288299
return Launcher(
289300
hostnames=hostnames,
290301
workers_per_host=workers_per_host,

src/torchrunx/slurm.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1+
from __future__ import annotations
2+
13
import os
24
import subprocess
35

46

5-
def slurm_hosts() -> "list[str]":
7+
def slurm_hosts() -> list[str]:
68
"""Retrieves hostnames of Slurm-allocated nodes.
79
810
:return: Hostnames of nodes in current Slurm allocation

0 commit comments

Comments
 (0)