Skip to content

More changes to docs #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Oct 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3bd7f1b
Update launcher.py
pmcurtin Oct 19, 2024
2ce3576
Merge branch 'main' into docs-2
apoorvkh Oct 19, 2024
2731d31
Merge branch 'main' into docs-2
apoorvkh Oct 20, 2024
0b9c1df
moved log_handlers into .run()
apoorvkh Oct 20, 2024
af8c829
update contributing
apoorvkh Oct 20, 2024
4ac384e
add tyro, remove setuptools from extras
apoorvkh Oct 20, 2024
cbf40b9
enabled linting for docs; clarified public/private functions
apoorvkh Oct 20, 2024
76aa20f
docs for utils.py
apoorvkh Oct 20, 2024
de93aaf
docs for logging_utils
apoorvkh Oct 20, 2024
e4977fd
Merge branch 'docs-2' of github.com:apoorvkh/torchrunx into worker-ex…
apoorvkh Oct 20, 2024
e697257
advanced docs
apoorvkh Oct 20, 2024
748c2b7
adding napoleon for google docs
apoorvkh Oct 21, 2024
24f4a98
linkcode
apoorvkh Oct 21, 2024
cb6620c
update linkcode
apoorvkh Oct 21, 2024
3eb297c
try again
apoorvkh Oct 21, 2024
e609f54
fix?
apoorvkh Oct 21, 2024
e88e320
now linkcode works
apoorvkh Oct 21, 2024
bef8b28
updates
apoorvkh Oct 21, 2024
86bb67b
automethod run for launcher
apoorvkh Oct 21, 2024
d80d822
maximum_signature_line_length
apoorvkh Oct 21, 2024
9950e96
switch to members?
apoorvkh Oct 21, 2024
8276abc
Merge branch 'main' of github.com:apoorvkh/torchrunx into docs-2
apoorvkh Oct 29, 2024
f335140
created utils/
apoorvkh Oct 29, 2024
0b5e316
moved functions to worker.py
apoorvkh Oct 29, 2024
084061f
renamed to worker_entrypoint
apoorvkh Oct 29, 2024
6cc9311
completed docs for utils
apoorvkh Oct 29, 2024
490f2a8
more launcher docs
apoorvkh Oct 29, 2024
e54a533
more updates to docs
apoorvkh Oct 29, 2024
455c3f3
switched LaunchResult to get
apoorvkh Oct 29, 2024
f967218
bump hash in pixi lock
apoorvkh Oct 29, 2024
3a68eb6
removed overloading from LaunchResult
apoorvkh Oct 29, 2024
9e2d5f4
update all docs
apoorvkh Oct 30, 2024
a29212e
fix
apoorvkh Oct 30, 2024
7bf9222
small edits
apoorvkh Oct 30, 2024
122febc
how it works
apoorvkh Oct 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
# Contributing

We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository. We use `ruff` for linting and formatting, `pyright` for static type checking, and `pytest` for testing. We build for `PyPI`. Our release pipeline is powered by Github Actions.
We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository to activate the environment.

We use `ruff check` for linting, `ruff format` for formatting, `pyright` for static type checking, and `pytest` for testing.

We build wheels with `python -m build` and upload to [PyPI](https://pypi.org/project/torchrunx) with [twine](https://twine.readthedocs.io). Our release pipeline is powered by Github Actions.

## Pull Requests

Make a pull request with your changes on Github and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in __torchrunx__.

## Testing

`tests/` contains `pytest`-style tests for validating that code changes do not break the core functionality of our library.

At the moment, we run `pytest tests/test_ci.py` (i.e. simple single-node CPU-only tests) in our Github Actions CI pipeline (`.github/workflows/release.yml`). One can manually run our more involved tests (on GPUs, on multiple machines from SLURM) on their own hardware.
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,12 +56,13 @@ Here's a simple example where we "train" a model on two nodes (with 2 GPUs each)
import torchrunx as trx

if __name__ == "__main__":
trained_model = trx.launch(
result = trx.launch(
func=train,
hostnames=["localhost", "other_node"],
workers_per_host=2 # num. GPUs
).value(rank=0) # get returned object
workers_per_host=2 # number of GPUs
)

trained_model = result.rank(0)
torch.save(trained_model.state_dict(), "model.pth")
```

Expand All @@ -70,9 +71,9 @@ if __name__ == "__main__":

## Why should I use this?

Whether you have 1 GPU, 8 GPUs, or 8 machines.
Whether you have 1 GPU, 8 GPUs, or 8 machines:

__Features:__
__Features__

- Our [`launch()`](https://torchrunx.readthedocs.io/stable/api.html#torchrunx.launch) utility is super _Pythonic_
- Return objects from your workers
Expand All @@ -81,13 +82,13 @@ __Features:__
- Fine-grained control over logging, environment variables, exception handling, etc.
- Automatic integration with SLURM

__Robustness:__
__Robustness__

- If you want to run a complex, _modular_ workflow in __one__ script
- don't parallelize your entire script: just the functions you want!
- no worries about memory leaks or OS failures

__Convenience:__
__Convenience__

- If you don't want to:
- set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself
Expand Down
118 changes: 59 additions & 59 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,101 +14,101 @@ We could also launch multiple functions (e.g. train on many GPUs, test on one GP
func=train,
hostnames=["node1", "node2"],
workers_per_host=8
).value(rank=0)
).rank(0)

accuracy = trx.launch(
func=test,
func_kwargs={'model': model},
func_args=(trained_model,),
hostnames=["localhost"],
workers_per_host=1
).value(rank=0)
).rank(0)

print(f'Accuracy: {accuracy}')

``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.

:mod:`torchrunx.launch` is self-cleaning: all processes are terminated (and the used memory is completely released) before the subsequent invocation.

Environment Detection
---------------------

By default, the `hostnames` or `workers_per_host` :mod:`torchrunx.launch` parameters are set to "auto". These parameters are populated via `SLURM`_ if a SLURM environment is automatically detected. Otherwise, `hostnames = ["localhost"]` and `workers_per_host` is set to the number of GPUs or CPUs (in order of precedence) available locally.

SLURM
+++++

If the `hostnames` or `workers_per_host` parameters are set to `"slurm"`, their values will be filled from the SLURM job. Passing `"slurm"` raises a `RuntimeError` if no SLURM allocation is detected from the environment.

``Launcher`` class
------------------
Launcher class
--------------

We provide the ``torchrunx.Launcher`` class as an alternative to ``torchrunx.launch``.
We provide the :mod:`torchrunx.Launcher` class as an alias to :mod:`torchrunx.launch`.

.. autoclass:: torchrunx.Launcher
:members:
.. .. autofunction:: torchrunx.Launcher.run
:members:

CLI Support
+++++++++++
CLI integration
^^^^^^^^^^^^^^^

This allows **torchrunx** arguments to be more easily populated by CLI packages like `tyro <https://brentyi.github.io/tyro/>`_:
We can use :mod:`torchrunx.Launcher` to populate arguments from the CLI (e.g. with `tyro <https://brentyi.github.io/tyro/>`_):

.. code:: python

import torchrunx as trx
import tyro

def distributed_function():
print("Hello world!")
pass

if __name__ == "__main__":
launcher = tyro.cli(trx.Launcher)
launcher.run(distributed_function, {})
launcher.run(distributed_function)

For example, the `python ... --help` command will then result in:
``python ... --help`` then results in:

.. code:: bash

╭─ options ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ -h, --help show this help message and exit │
│ --hostnames {[STR [STR ...]]}|{auto,slurm} │
│ (default: auto) │
│ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm} │
│ (default: auto) │
│ --ssh-config-file {None}|STR|PATH │
│ (default: None) │
│ --backend {None,nccl,gloo,mpi,ucc,auto} │
│ (default: auto) │
│ --log-handlers {fixed} (fixed to: a u t o) │
│ --env-vars STR (default: PATH LD_LIBRARY LIBRARY_PATH 'PYTHON*' 'CUDA*' 'TORCH*' 'PYTORCH*' 'NCCL*') │
│ --env-file {None}|STR|PATH │
│ (default: None) │
│ --timeout INT (default: 600) │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Custom Logging
--------------
╭─ options ─────────────────────────────────────────────╮
│ -h, --help show this help message and exit │
│ --hostnames {[STR [STR ...]]}|{auto,slurm} │
│ (default: auto) │
│ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm} │
│ (default: auto) │
│ --ssh-config-file {None}|STR|PATH │
│ (default: None) │
│ --backend {None,nccl,gloo,mpi,ucc,auto} │
│ (default: auto) │
│ --timeout INT (default: 600) │
│ --default-env-vars [STR [STR ...]] │
│ (default: PATH LD_LIBRARY ...) │
│ --extra-env-vars [STR [STR ...]] │
│ (default: ) │
│ --env-file {None}|STR|PATH │
│ (default: None) │
╰───────────────────────────────────────────────────────╯

SLURM integration
-----------------

By default, the ``hostnames`` or ``workers_per_host`` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (localhost) with N workers (num. GPUs or CPUs).
Raises a ``RuntimeError`` if ``hostnames="slurm"`` or ``workers_per_host="slurm"`` but no allocation is detected.

Propagating exceptions
----------------------

Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
Exceptions that are raised in workers will be raised by the launcher process.

Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
A :mod:`torchrunx.AgentFailedError` or :mod:`torchrunx.WorkerFailedError` will be raised if any agent or worker dies unexpectedly (e.g. if sent a signal from the OS, due to segmentation faults or OOM).

Propagating Exceptions
----------------------
Environment variables
---------------------

Exceptions that are raised in Workers will be raised in the Launcher process and can be caught by wrapping :mod:`torchrunx.launch` in a try-except clause.
Environment variables in the launcher process that match the ``default_env_vars`` argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. Environment variables are pattern-matched with this list using ``fnmatch``.

If a worker is killed by the operating system (e.g. due to Segmentation Fault or SIGKILL by running out of memory), the Launcher process raises a RuntimeError.
``default_env_vars`` can be overriden if desired. This list can be augmented using ``extra_env_vars``. Additional environment variables (and more custom bash logic) can be included via the ``env_file`` argument. Our agents ``source`` this file.

Environment Variables
---------------------
We also set the following environment variables in each worker: ``LOCAL_RANK``, ``RANK``, ``LOCAL_WORLD_SIZE``, ``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.

Custom logging
--------------

We forward all logs (i.e. from :mod:`logging` and :mod:`sys.stdout`/:mod:`sys.stderr`) from workers and agents to the launcher. By default, the logs from the first agent and its first worker are printed into the launcher's ``stdout`` stream. Logs from all agents and workers are written to files in ``$TORCHRUNX_LOG_DIR`` (default: ``./torchrunx_logs``) and are named by timestamp, hostname, and local_rank.

:mod:`logging.Handler` objects can be provided via the ``log_handlers`` argument to provide further customization (mapping specific agents/workers to custom output streams).

The :mod:`torchrunx.launch` ``env_vars`` argument allows the user to specify which environmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
We provide some utilities to help:

:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
.. autofunction:: torchrunx.file_handler

..
TODO: example env_file
.. autofunction:: torchrunx.stream_handler

Support for Numpy >= 2.0
------------------------
only supported if `torch>=2.3`
.. autofunction:: torchrunx.add_filter_to_handler
4 changes: 3 additions & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,6 @@ API
.. autoclass:: torchrunx.LaunchResult
:members:

.. autoclass:: torchrunx.AgentKilledError
.. autoclass:: torchrunx.AgentFailedError

.. autoclass:: torchrunx.WorkerFailedError
107 changes: 76 additions & 31 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -1,51 +1,96 @@
import os
import sys

sys.path.insert(0, os.path.abspath('../../src'))
sys.path.insert(0, os.path.abspath("../../src"))

# Configuration file for the Sphinx documentation builder.

# -- Project information

project = 'torchrunx'

# -- General configuration
project = "torchrunx"
github_username = "apoorvkh"
github_repository = "torchrunx"
html_theme = "furo"

extensions = [
'sphinx.ext.duration',
'sphinx.ext.doctest',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.intersphinx',
'myst_parser',
'sphinx_toolbox.sidebar_links',
'sphinx_toolbox.github',
'sphinx.ext.autodoc.typehints',
#"sphinx_autodoc_typehints",
"sphinx.ext.duration",
"sphinx.ext.autodoc",
"sphinx.ext.intersphinx",
"myst_parser",
"sphinx_toolbox.sidebar_links",
"sphinx_toolbox.github",
"sphinx.ext.napoleon",
"sphinx.ext.autodoc.typehints",
"sphinx.ext.linkcode",
]

autodoc_mock_imports = ["torch", "fabric", "cloudpickle", "sys", "logging", "typing_extensions"]
autodoc_typehints = "both"
#typehints_defaults = 'comma'

github_username = 'apoorvkh'
github_repository = 'torchrunx'
autodoc_typehints_description_target = "documented_params"

autodoc_mock_imports = ['torch', 'fabric', 'cloudpickle', 'typing_extensions']
maximum_signature_line_length = 100

intersphinx_mapping = {
'python': ('https://docs.python.org/3/', None),
'sphinx': ('https://www.sphinx-doc.org/en/master/', None),
"python": ("https://docs.python.org/3/", None),
}
intersphinx_disabled_domains = ['std']
intersphinx_disabled_domains = ["std"]


## Link code to Github source
# From: https://github.com/scikit-learn/scikit-learn/blob/main/doc/sphinxext/github_link.py

import inspect
import os
import subprocess
import sys
from operator import attrgetter

package = project

try:
revision = (
subprocess.check_output("git rev-parse --short HEAD".split()).strip().decode("utf-8")
)
except (subprocess.CalledProcessError, OSError):
print("Failed to execute git to get revision")
revision = None

url_fmt = (
f"https://github.com/{github_username}/{github_repository}/"
"blob/{revision}/src/{package}/{path}#L{lineno}"
)

def linkcode_resolve(domain, info):
if revision is None:
return
if domain not in ("py", "pyx"):
return
if not info.get("module") or not info.get("fullname"):
return

templates_path = ['_templates']
class_name = info["fullname"].split(".")[0]
module = __import__(info["module"], fromlist=[class_name])
obj = attrgetter(info["fullname"])(module)

# -- Options for HTML output
# Unwrap the object to get the correct source
# file in case that is wrapped by a decorator
obj = inspect.unwrap(obj)

html_theme = 'furo'
try:
fn = inspect.getsourcefile(obj)
except Exception:
fn = None
if not fn:
try:
fn = inspect.getsourcefile(sys.modules[obj.__module__])
except Exception:
fn = None
if not fn:
return

# -- Options for EPUB output
epub_show_urls = 'footnote'
fn = os.path.relpath(fn, start=os.path.dirname(__import__(package).__file__))
try:
lineno = inspect.getsourcelines(obj)[1]
except Exception:
lineno = ""
return url_fmt.format(revision=revision, package=package, path=fn, lineno=lineno)

# code block syntax highlighting
#pygments_style = 'sphinx'
## End of "link code to Github source"
18 changes: 0 additions & 18 deletions docs/source/contributing.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,2 @@
Contributing
============

.. include:: ../../CONTRIBUTING.md
:parser: myst_parser.sphinx_

.. Development environment
.. -----------------------

.. Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing.

.. Testing
.. -------

.. ``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.

.. Contributing
.. ------------

.. Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.
Loading