Skip to content

Commit 677edcb

Browse files
authored
Merge pull request #73 from apoorvkh/docs-2
More changes to docs
2 parents e31f967 + 122febc commit 677edcb

22 files changed

+702
-532
lines changed

CONTRIBUTING.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
11
# Contributing
22

3-
We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository. We use `ruff` for linting and formatting, `pyright` for static type checking, and `pytest` for testing. We build for `PyPI`. Our release pipeline is powered by Github Actions.
3+
We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository to activate the environment.
4+
5+
We use `ruff check` for linting, `ruff format` for formatting, `pyright` for static type checking, and `pytest` for testing.
6+
7+
We build wheels with `python -m build` and upload to [PyPI](https://pypi.org/project/torchrunx) with [twine](https://twine.readthedocs.io). Our release pipeline is powered by Github Actions.
8+
9+
## Pull Requests
10+
11+
Make a pull request with your changes on Github and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in __torchrunx__.
12+
13+
## Testing
14+
15+
`tests/` contains `pytest`-style tests for validating that code changes do not break the core functionality of our library.
16+
17+
At the moment, we run `pytest tests/test_ci.py` (i.e. simple single-node CPU-only tests) in our Github Actions CI pipeline (`.github/workflows/release.yml`). One can manually run our more involved tests (on GPUs, on multiple machines from SLURM) on their own hardware.

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,13 @@ Here's a simple example where we "train" a model on two nodes (with 2 GPUs each)
5656
import torchrunx as trx
5757

5858
if __name__ == "__main__":
59-
trained_model = trx.launch(
59+
result = trx.launch(
6060
func=train,
6161
hostnames=["localhost", "other_node"],
62-
workers_per_host=2 # num. GPUs
63-
).value(rank=0) # get returned object
62+
workers_per_host=2 # number of GPUs
63+
)
6464

65+
trained_model = result.rank(0)
6566
torch.save(trained_model.state_dict(), "model.pth")
6667
```
6768

@@ -70,9 +71,9 @@ if __name__ == "__main__":
7071

7172
## Why should I use this?
7273

73-
Whether you have 1 GPU, 8 GPUs, or 8 machines.
74+
Whether you have 1 GPU, 8 GPUs, or 8 machines:
7475

75-
__Features:__
76+
__Features__
7677

7778
- Our [`launch()`](https://torchrunx.readthedocs.io/stable/api.html#torchrunx.launch) utility is super _Pythonic_
7879
- Return objects from your workers
@@ -81,13 +82,13 @@ __Features:__
8182
- Fine-grained control over logging, environment variables, exception handling, etc.
8283
- Automatic integration with SLURM
8384

84-
__Robustness:__
85+
__Robustness__
8586

8687
- If you want to run a complex, _modular_ workflow in __one__ script
8788
- don't parallelize your entire script: just the functions you want!
8889
- no worries about memory leaks or OS failures
8990

90-
__Convenience:__
91+
__Convenience__
9192

9293
- If you don't want to:
9394
- set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself

docs/source/advanced.rst

Lines changed: 59 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -14,101 +14,101 @@ We could also launch multiple functions (e.g. train on many GPUs, test on one GP
1414
func=train,
1515
hostnames=["node1", "node2"],
1616
workers_per_host=8
17-
).value(rank=0)
17+
).rank(0)
1818
1919
accuracy = trx.launch(
2020
func=test,
21-
func_kwargs={'model': model},
21+
func_args=(trained_model,),
2222
hostnames=["localhost"],
2323
workers_per_host=1
24-
).value(rank=0)
24+
).rank(0)
2525
2626
print(f'Accuracy: {accuracy}')
2727
28-
``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.
2928
29+
:mod:`torchrunx.launch` is self-cleaning: all processes are terminated (and the used memory is completely released) before the subsequent invocation.
3030

31-
Environment Detection
32-
---------------------
33-
34-
By default, the `hostnames` or `workers_per_host` :mod:`torchrunx.launch` parameters are set to "auto". These parameters are populated via `SLURM`_ if a SLURM environment is automatically detected. Otherwise, `hostnames = ["localhost"]` and `workers_per_host` is set to the number of GPUs or CPUs (in order of precedence) available locally.
35-
36-
SLURM
37-
+++++
38-
39-
If the `hostnames` or `workers_per_host` parameters are set to `"slurm"`, their values will be filled from the SLURM job. Passing `"slurm"` raises a `RuntimeError` if no SLURM allocation is detected from the environment.
40-
41-
``Launcher`` class
42-
------------------
31+
Launcher class
32+
--------------
4333

44-
We provide the ``torchrunx.Launcher`` class as an alternative to ``torchrunx.launch``.
34+
We provide the :mod:`torchrunx.Launcher` class as an alias to :mod:`torchrunx.launch`.
4535

4636
.. autoclass:: torchrunx.Launcher
47-
:members:
48-
.. .. autofunction:: torchrunx.Launcher.run
37+
:members:
4938

50-
CLI Support
51-
+++++++++++
39+
CLI integration
40+
^^^^^^^^^^^^^^^
5241

53-
This allows **torchrunx** arguments to be more easily populated by CLI packages like `tyro <https://brentyi.github.io/tyro/>`_:
42+
We can use :mod:`torchrunx.Launcher` to populate arguments from the CLI (e.g. with `tyro <https://brentyi.github.io/tyro/>`_):
5443

5544
.. code:: python
5645
5746
import torchrunx as trx
5847
import tyro
5948
6049
def distributed_function():
61-
print("Hello world!")
50+
pass
6251
6352
if __name__ == "__main__":
6453
launcher = tyro.cli(trx.Launcher)
65-
launcher.run(distributed_function, {})
54+
launcher.run(distributed_function)
6655
67-
For example, the `python ... --help` command will then result in:
56+
``python ... --help`` then results in:
6857

6958
.. code:: bash
7059
71-
╭─ options ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
72-
│ -h, --help show this help message and exit
73-
│ --hostnames {[STR [STR ...]]}|{auto,slurm} │
74-
│ (default: auto) │
75-
│ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm} │
76-
│ (default: auto) │
77-
│ --ssh-config-file {None}|STR|PATH │
78-
│ (default: None) │
79-
│ --backend {None,nccl,gloo,mpi,ucc,auto} │
80-
│ (default: auto) │
81-
│ --log-handlers {fixed} (fixed to: a u t o) │
82-
│ --env-vars STR (default: PATH LD_LIBRARY LIBRARY_PATH 'PYTHON*' 'CUDA*' 'TORCH*' 'PYTORCH*' 'NCCL*') │
83-
│ --env-file {None}|STR|PATH │
84-
│ (default: None) │
85-
│ --timeout INT (default: 600) │
86-
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
87-
88-
Custom Logging
89-
--------------
60+
╭─ options ─────────────────────────────────────────────╮
61+
│ -h, --help show this help message and exit
62+
│ --hostnames {[STR [STR ...]]}|{auto,slurm} │
63+
│ (default: auto) │
64+
│ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm} │
65+
│ (default: auto) │
66+
│ --ssh-config-file {None}|STR|PATH │
67+
│ (default: None) │
68+
│ --backend {None,nccl,gloo,mpi,ucc,auto} │
69+
│ (default: auto) │
70+
│ --timeout INT (default: 600) │
71+
│ --default-env-vars [STR [STR ...]] │
72+
│ (default: PATH LD_LIBRARY ...) │
73+
│ --extra-env-vars [STR [STR ...]] │
74+
│ (default: ) │
75+
│ --env-file {None}|STR|PATH │
76+
│ (default: None) │
77+
╰───────────────────────────────────────────────────────╯
78+
79+
SLURM integration
80+
-----------------
81+
82+
By default, the ``hostnames`` or ``workers_per_host`` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (localhost) with N workers (num. GPUs or CPUs).
83+
Raises a ``RuntimeError`` if ``hostnames="slurm"`` or ``workers_per_host="slurm"`` but no allocation is detected.
84+
85+
Propagating exceptions
86+
----------------------
9087
91-
Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
88+
Exceptions that are raised in workers will be raised by the launcher process.
9289
93-
Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
90+
A :mod:`torchrunx.AgentFailedError` or :mod:`torchrunx.WorkerFailedError` will be raised if any agent or worker dies unexpectedly (e.g. if sent a signal from the OS, due to segmentation faults or OOM).
9491
95-
Propagating Exceptions
96-
----------------------
92+
Environment variables
93+
---------------------
9794
98-
Exceptions that are raised in Workers will be raised in the Launcher process and can be caught by wrapping :mod:`torchrunx.launch` in a try-except clause.
95+
Environment variables in the launcher process that match the ``default_env_vars`` argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. Environment variables are pattern-matched with this list using ``fnmatch``.
9996
100-
If a worker is killed by the operating system (e.g. due to Segmentation Fault or SIGKILL by running out of memory), the Launcher process raises a RuntimeError.
97+
``default_env_vars`` can be overriden if desired. This list can be augmented using ``extra_env_vars``. Additional environment variables (and more custom bash logic) can be included via the ``env_file`` argument. Our agents ``source`` this file.
10198
102-
Environment Variables
103-
---------------------
99+
We also set the following environment variables in each worker: ``LOCAL_RANK``, ``RANK``, ``LOCAL_WORLD_SIZE``, ``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.
100+
101+
Custom logging
102+
--------------
103+
104+
We forward all logs (i.e. from :mod:`logging` and :mod:`sys.stdout`/:mod:`sys.stderr`) from workers and agents to the launcher. By default, the logs from the first agent and its first worker are printed into the launcher's ``stdout`` stream. Logs from all agents and workers are written to files in ``$TORCHRUNX_LOG_DIR`` (default: ``./torchrunx_logs``) and are named by timestamp, hostname, and local_rank.
105+
106+
:mod:`logging.Handler` objects can be provided via the ``log_handlers`` argument to provide further customization (mapping specific agents/workers to custom output streams).
104107
105-
The :mod:`torchrunx.launch` ``env_vars`` argument allows the user to specify which environmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
108+
We provide some utilities to help:
106109
107-
:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
110+
.. autofunction:: torchrunx.file_handler
108111
109-
..
110-
TODO: example env_file
112+
.. autofunction:: torchrunx.stream_handler
111113
112-
Support for Numpy >= 2.0
113-
------------------------
114-
only supported if `torch>=2.3`
114+
.. autofunction:: torchrunx.add_filter_to_handler

docs/source/api.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@ API
66
.. autoclass:: torchrunx.LaunchResult
77
:members:
88

9-
.. autoclass:: torchrunx.AgentKilledError
9+
.. autoclass:: torchrunx.AgentFailedError
10+
11+
.. autoclass:: torchrunx.WorkerFailedError

docs/source/conf.py

Lines changed: 76 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,96 @@
11
import os
22
import sys
33

4-
sys.path.insert(0, os.path.abspath('../../src'))
4+
sys.path.insert(0, os.path.abspath("../../src"))
55

66
# Configuration file for the Sphinx documentation builder.
77

8-
# -- Project information
9-
10-
project = 'torchrunx'
11-
12-
# -- General configuration
8+
project = "torchrunx"
9+
github_username = "apoorvkh"
10+
github_repository = "torchrunx"
11+
html_theme = "furo"
1312

1413
extensions = [
15-
'sphinx.ext.duration',
16-
'sphinx.ext.doctest',
17-
'sphinx.ext.autodoc',
18-
'sphinx.ext.autosummary',
19-
'sphinx.ext.intersphinx',
20-
'myst_parser',
21-
'sphinx_toolbox.sidebar_links',
22-
'sphinx_toolbox.github',
23-
'sphinx.ext.autodoc.typehints',
24-
#"sphinx_autodoc_typehints",
14+
"sphinx.ext.duration",
15+
"sphinx.ext.autodoc",
16+
"sphinx.ext.intersphinx",
17+
"myst_parser",
18+
"sphinx_toolbox.sidebar_links",
19+
"sphinx_toolbox.github",
20+
"sphinx.ext.napoleon",
21+
"sphinx.ext.autodoc.typehints",
22+
"sphinx.ext.linkcode",
2523
]
2624

25+
autodoc_mock_imports = ["torch", "fabric", "cloudpickle", "sys", "logging", "typing_extensions"]
2726
autodoc_typehints = "both"
28-
#typehints_defaults = 'comma'
29-
30-
github_username = 'apoorvkh'
31-
github_repository = 'torchrunx'
27+
autodoc_typehints_description_target = "documented_params"
3228

33-
autodoc_mock_imports = ['torch', 'fabric', 'cloudpickle', 'typing_extensions']
29+
maximum_signature_line_length = 100
3430

3531
intersphinx_mapping = {
36-
'python': ('https://docs.python.org/3/', None),
37-
'sphinx': ('https://www.sphinx-doc.org/en/master/', None),
32+
"python": ("https://docs.python.org/3/", None),
3833
}
39-
intersphinx_disabled_domains = ['std']
34+
intersphinx_disabled_domains = ["std"]
35+
36+
37+
## Link code to Github source
38+
# From: https://github.com/scikit-learn/scikit-learn/blob/main/doc/sphinxext/github_link.py
39+
40+
import inspect
41+
import os
42+
import subprocess
43+
import sys
44+
from operator import attrgetter
45+
46+
package = project
47+
48+
try:
49+
revision = (
50+
subprocess.check_output("git rev-parse --short HEAD".split()).strip().decode("utf-8")
51+
)
52+
except (subprocess.CalledProcessError, OSError):
53+
print("Failed to execute git to get revision")
54+
revision = None
55+
56+
url_fmt = (
57+
f"https://github.com/{github_username}/{github_repository}/"
58+
"blob/{revision}/src/{package}/{path}#L{lineno}"
59+
)
60+
61+
def linkcode_resolve(domain, info):
62+
if revision is None:
63+
return
64+
if domain not in ("py", "pyx"):
65+
return
66+
if not info.get("module") or not info.get("fullname"):
67+
return
4068

41-
templates_path = ['_templates']
69+
class_name = info["fullname"].split(".")[0]
70+
module = __import__(info["module"], fromlist=[class_name])
71+
obj = attrgetter(info["fullname"])(module)
4272

43-
# -- Options for HTML output
73+
# Unwrap the object to get the correct source
74+
# file in case that is wrapped by a decorator
75+
obj = inspect.unwrap(obj)
4476

45-
html_theme = 'furo'
77+
try:
78+
fn = inspect.getsourcefile(obj)
79+
except Exception:
80+
fn = None
81+
if not fn:
82+
try:
83+
fn = inspect.getsourcefile(sys.modules[obj.__module__])
84+
except Exception:
85+
fn = None
86+
if not fn:
87+
return
4688

47-
# -- Options for EPUB output
48-
epub_show_urls = 'footnote'
89+
fn = os.path.relpath(fn, start=os.path.dirname(__import__(package).__file__))
90+
try:
91+
lineno = inspect.getsourcelines(obj)[1]
92+
except Exception:
93+
lineno = ""
94+
return url_fmt.format(revision=revision, package=package, path=fn, lineno=lineno)
4995

50-
# code block syntax highlighting
51-
#pygments_style = 'sphinx'
96+
## End of "link code to Github source"

docs/source/contributing.rst

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,2 @@
1-
Contributing
2-
============
3-
41
.. include:: ../../CONTRIBUTING.md
52
:parser: myst_parser.sphinx_
6-
7-
.. Development environment
8-
.. -----------------------
9-
10-
.. Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing.
11-
12-
.. Testing
13-
.. -------
14-
15-
.. ``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.
16-
17-
.. Contributing
18-
.. ------------
19-
20-
.. Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.

0 commit comments

Comments
 (0)