Skip to content

docs (draft 1) #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Oct 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
5d1ca04
Update README.md
pmcurtin Sep 30, 2024
c2a46a0
Update launcher.py
pmcurtin Sep 30, 2024
8121683
Update api.rst
pmcurtin Sep 30, 2024
44bd676
Update api.rst
pmcurtin Sep 30, 2024
8552359
Update api.rst
pmcurtin Sep 30, 2024
2a1061b
Update api.rst
pmcurtin Sep 30, 2024
15411b5
Update api.rst
pmcurtin Sep 30, 2024
de0b140
Update api.rst
pmcurtin Sep 30, 2024
9419554
Update __init__.py
pmcurtin Sep 30, 2024
dd10c49
Update api.rst
pmcurtin Sep 30, 2024
e974755
Update advanced.rst
pmcurtin Sep 30, 2024
e6faa44
Merge branch 'main' into docs
apoorvkh Sep 30, 2024
e3c0665
Update README.md
apoorvkh Oct 3, 2024
c48edaf
Update index.rst
apoorvkh Oct 3, 2024
66933e7
Update index.rst
apoorvkh Oct 3, 2024
e2536fc
Update index.rst
apoorvkh Oct 3, 2024
f47b492
Update requirements.txt
pmcurtin Oct 4, 2024
f24f936
Update conf.py
pmcurtin Oct 4, 2024
fec0f40
Update conf.py
pmcurtin Oct 4, 2024
fb35066
Update launcher.py
pmcurtin Oct 4, 2024
9ecd997
Update conf.py
pmcurtin Oct 4, 2024
1085c63
Update conf.py
pmcurtin Oct 4, 2024
74be93c
remove requirement
pmcurtin Oct 4, 2024
afd9808
Update launcher.py
pmcurtin Oct 4, 2024
4b82752
Update launcher.py
pmcurtin Oct 4, 2024
7783bc9
Update api.rst
pmcurtin Oct 18, 2024
5139db2
Update api.rst
pmcurtin Oct 18, 2024
fedbd30
Update api.rst
pmcurtin Oct 18, 2024
e507b78
Update launcher.py
pmcurtin Oct 18, 2024
3d71997
try removing launch types
pmcurtin Oct 18, 2024
662e899
touch up launch formatting
pmcurtin Oct 18, 2024
1147a71
Update launcher.py
pmcurtin Oct 18, 2024
bf7965a
touch up example in readme
pmcurtin Oct 18, 2024
2a1329e
fix first example
pmcurtin Oct 18, 2024
6768ddb
Update launch docs
apoorvkh Oct 18, 2024
1fd6ddb
Update README.md
pmcurtin Oct 18, 2024
c8635d0
Merge branch 'docs' of github.com:apoorvkh/torchrunx into docs
apoorvkh Oct 18, 2024
0452f11
Merge branch 'docs' of github.com:apoorvkh/torchrunx into docs
apoorvkh Oct 18, 2024
6357e85
Update README.md
pmcurtin Oct 18, 2024
fc6e3cb
fixing quotes
apoorvkh Oct 18, 2024
d31ba90
Merge branch 'docs' of github.com:apoorvkh/torchrunx into docs
apoorvkh Oct 18, 2024
fe57ba2
remove return type
apoorvkh Oct 18, 2024
6782718
moved complicated example to advanced
apoorvkh Oct 18, 2024
f1d2ec2
test
apoorvkh Oct 18, 2024
dee60b3
update readme example
apoorvkh Oct 19, 2024
3a55e79
update readme
apoorvkh Oct 19, 2024
c0ea355
update readme
apoorvkh Oct 19, 2024
9332adf
readme updates
apoorvkh Oct 19, 2024
bc30b7d
Merge branch 'main' into docs
apoorvkh Oct 19, 2024
82f576f
test
apoorvkh Oct 19, 2024
ceb04fc
Merge branch 'docs' of github.com:apoorvkh/torchrunx into docs
apoorvkh Oct 19, 2024
cb182ed
misc readthedocs
apoorvkh Oct 19, 2024
c6b7e99
docs fix
apoorvkh Oct 19, 2024
3f4e025
fix incorrect imports
apoorvkh Oct 19, 2024
15d0c0e
fix readthedocs
apoorvkh Oct 19, 2024
ebf1a14
more docs
apoorvkh Oct 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 54 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# torchrunx 🔥

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/torchrunx)](https://github.com/apoorvkh/torchrunx/blob/main/pyproject.toml)
[![PyTorch Version](https://img.shields.io/badge/torch-%3E%3D2.0-orange)](https://github.com/pytorch/pytorch)
[![PyPI - Version](https://img.shields.io/pypi/v/torchrunx)](https://pypi.org/project/torchrunx/)
![Tests](https://img.shields.io/github/actions/workflow/status/apoorvkh/torchrunx/.github%2Fworkflows%2Fmain.yml)
[![Docs](https://readthedocs.org/projects/torchrunx/badge/?version=stable)](https://torchrunx.readthedocs.io)
Expand All @@ -16,102 +17,78 @@ By [Apoorv Khandelwal](http://apoorvkh.com) and [Peter Curtin](https://github.co
pip install torchrunx
```

Requires: Linux, Python >= 3.8.1, PyTorch >= 2.0
**Requires:** Linux (with shared filesystem & SSH access if using multiple machines)

Shared filesystem & SSH access if using multiple machines
## Demo

## Minimal example
Here's a simple example where we "train" a model on two nodes (with 2 GPUs each).

Here's a simple example where we distribute `distributed_function` to two hosts (with 2 GPUs each):
<details>
<summary>Training code</summary>

```python
def train_model(model, dataset):
trained_model = train(model, dataset)

if int(os.environ["RANK"]) == 0:
torch.save(learned_model, 'model.pt')
return 'model.pt'

return None
```

```python
import torchrunx as trx

model_path = trx.launch(
func=train_model,
func_kwargs={'model': my_model, 'training_dataset': mnist_train},
hostnames=["localhost", "other_node"],
workers_per_host=2
)["localhost"][0] # return from rank 0 (first worker on "localhost")
```

## Why should I use this?

[`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) is a hammer. `torchrunx` is a chisel.

Whether you have 1 GPU, 8 GPUs, or 8 machines:

Convenience:
```python
import os
import torch

- If you don't want to set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself
- If you want to run `python myscript.py` instead of `torchrun myscript.py`
- If you don't want to manually SSH and run `torchrun --master-ip --master-port ...` on every machine (and if you don't want to babysit these machines for hanging failures)
def train():
rank = int(os.environ['RANK'])
local_rank = int(os.environ['LOCAL_RANK'])

Robustness:
model = torch.nn.Linear(10, 10).to(local_rank)
ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
optimizer = torch.optim.AdamW(ddp_model.parameters())

- If you want to run a complex, _modular_ workflow in one script
- no worries about memory leaks or OS failures
- don't parallelize your entire script: just the functions you want

Features:
optimizer.zero_grad()
outputs = ddp_model(torch.randn(5, 10))
labels = torch.randn(5, 10).to(local_rank)
torch.nn.functional.mse_loss(outputs, labels).backward()
optimizer.step()

- Our launch utility is super _Pythonic_
- If you want to run distributed PyTorch functions from Python Notebooks.
- Automatic integration with SLURM
if rank == 0:
return model
```

Why not?
You could also use `transformers.Trainer` (or similar) to automatically handle all the multi-GPU / DDP code above.
</details>

- We don't support fault tolerance via torch elastic. Probably only useful if you are using 1000 GPUs. Maybe someone can make a PR.

## More complicated example
```python
import torchrunx as trx

We could also launch multiple functions, with different GPUs:
if __name__ == "__main__":
trained_model = trx.launch(
func=train,
hostnames=["localhost", "other_node"],
workers_per_host=2 # num. GPUs
).value(rank=0) # get returned object

```python
def train_model(model, dataset):
trained_model = train(model, dataset)
torch.save(trained_model.state_dict(), "model.pth")
```

if int(os.environ["RANK"]) == 0:
torch.save(learned_model, 'model.pt')
return 'model.pt'
### [Full API](https://torchrunx.readthedocs.io/stable/api.html)
### [Advanced Usage](https://torchrunx.readthedocs.io/stable/advanced.html)

return None
## Why should I use this?

def test_model(model_path, test_dataset):
model = torch.load(model_path)
accuracy = inference(model, test_dataset)
return accuracy
```
Whether you have 1 GPU, 8 GPUs, or 8 machines.

```python
import torchrunx as trx
__Features:__

model_path = trx.launch(
func=train_model,
func_kwargs={'model': my_model, 'training_dataset': mnist_train},
hostnames=["localhost", "other_node"],
workers_per_host=2
)["localhost"][0] # return from rank 0 (first worker on "localhost")
- Our [`launch()`](https://torchrunx.readthedocs.io/stable/api.html#torchrunx.launch) utility is super _Pythonic_
- Return objects from your workers
- Run `python script.py` instead of `torchrun script.py`
- Launch multi-node functions, even from Python Notebooks
- Fine-grained control over logging, environment variables, exception handling, etc.
- Automatic integration with SLURM

__Robustness:__

- If you want to run a complex, _modular_ workflow in __one__ script
- don't parallelize your entire script: just the functions you want!
- no worries about memory leaks or OS failures

accuracy = trx.launch(
func=test_model,
func_kwargs={'model': learned_model, 'test_dataset': mnist_test},
hostnames=["localhost"],
workers_per_host=1
)["localhost"][0]
__Convenience:__

print(f'Accuracy: {accuracy}')
```
- If you don't want to:
- set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself
- manually SSH into every machine and `torchrun --master-ip --master-port ...`, babysit failed processes, etc.
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sphinx==6.2.1
furo
myst-parser
sphinx-toolbox
sphinx-toolbox
40 changes: 29 additions & 11 deletions docs/source/advanced.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
Advanced Usage
==============

Multiple functions in one script
--------------------------------

We could also launch multiple functions (e.g. train on many GPUs, test on one GPU):

.. code-block:: python

import torchrunx as trx

trained_model = trx.launch(
func=train,
hostnames=["node1", "node2"],
workers_per_host=8
).value(rank=0)

accuracy = trx.launch(
func=test,
func_kwargs={'model': model},
hostnames=["localhost"],
workers_per_host=1
).value(rank=0)

print(f'Accuracy: {accuracy}')

``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.


Environment Detection
---------------------

Expand Down Expand Up @@ -61,18 +88,9 @@ For example, the `python ... --help` command will then result in:
Custom Logging
--------------

Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a :mod:`torchrunx.DefaultLogSpec` is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.

Custom logging classes can be subclassed from the :mod:`torchrunx.LogSpec` class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The :mod:`torchrunx.DefaultLogSpec` maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.

.. autoclass:: torchrunx.LogSpec
:members:

.. autoclass:: torchrunx.DefaultLogSpec
:members:
Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.

..
TODO: example log structure
Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.

Propagating Exceptions
----------------------
Expand Down
7 changes: 3 additions & 4 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
API
=============

..
TODO: examples, environmental variables available to workers (e.g. RANK, LOCAL_RANK)
.. autofunction:: torchrunx.launch(func: Callable, ...)

.. automodule:: torchrunx
:members: launch, slurm_hosts, slurm_workers
.. autoclass:: torchrunx.LaunchResult
:members:
7 changes: 6 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,13 @@
'myst_parser',
'sphinx_toolbox.sidebar_links',
'sphinx_toolbox.github',
'sphinx.ext.autodoc.typehints',
#"sphinx_autodoc_typehints",
]

autodoc_typehints = "both"
#typehints_defaults = 'comma'

github_username = 'apoorvkh'
github_repository = 'torchrunx'

Expand All @@ -43,4 +48,4 @@
epub_show_urls = 'footnote'

# code block syntax highlighting
#pygments_style = 'sphinx'
#pygments_style = 'sphinx'
11 changes: 3 additions & 8 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
Getting Started
===============

.. include:: ../../README.md
:parser: myst_parser.sphinx_

Contents
--------

.. toctree::
:maxdepth: 2
:hidden:
:maxdepth: 1

api
advanced
Expand All @@ -17,4 +12,4 @@ Contents

.. sidebar-links::
:github:
:pypi: torchrunx
:pypi: torchrunx
4 changes: 2 additions & 2 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion src/torchrunx/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
from .launcher import Launcher, launch
from .launcher import Launcher, LaunchResult, launch
from .logging_utils import add_filter_to_handler, file_handler, stream_handler

__all__ = [
"Launcher",
"launch",
"LaunchResult",
"add_filter_to_handler",
"file_handler",
"stream_handler",
Expand Down
2 changes: 1 addition & 1 deletion src/torchrunx/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def entrypoint(serialized_worker_args: SerializedWorkerArgs) -> Any | WorkerExce
os.environ["WORLD_SIZE"] = str(worker_args.world_size)
os.environ["MASTER_ADDR"] = worker_args.main_agent_hostname
os.environ["MASTER_PORT"] = str(worker_args.main_agent_port)

if worker_args.backend is not None:
backend = worker_args.backend
if backend == "auto":
Expand Down
Loading
Loading