apoorvkh · apoorvkh · Oct 19, 2024 · Sep 30, 2024 · Sep 30, 2024 · Sep 30, 2024
diff --git a/README.md b/README.md
@@ -1,6 +1,7 @@
 # torchrunx 🔥
 
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/torchrunx)](https://github.com/apoorvkh/torchrunx/blob/main/pyproject.toml)
+[![PyTorch Version](https://img.shields.io/badge/torch-%3E%3D2.0-orange)](https://github.com/pytorch/pytorch)
 [![PyPI - Version](https://img.shields.io/pypi/v/torchrunx)](https://pypi.org/project/torchrunx/)
 ![Tests](https://img.shields.io/github/actions/workflow/status/apoorvkh/torchrunx/.github%2Fworkflows%2Fmain.yml)
 [![Docs](https://readthedocs.org/projects/torchrunx/badge/?version=stable)](https://torchrunx.readthedocs.io)
@@ -16,102 +17,78 @@ By [Apoorv Khandelwal](http://apoorvkh.com) and [Peter Curtin](https://github.co
 pip install torchrunx
 ```
 
-Requires: Linux, Python >= 3.8.1, PyTorch >= 2.0
+**Requires:** Linux (with shared filesystem & SSH access if using multiple machines)
 
-Shared filesystem & SSH access if using multiple machines
+## Demo
 
-## Minimal example
+Here's a simple example where we "train" a model on two nodes (with 2 GPUs each).
 
-Here's a simple example where we distribute `distributed_function` to two hosts (with 2 GPUs each):
+<details>
+  <summary>Training code</summary>
 
-```python
-def train_model(model, dataset):
-    trained_model = train(model, dataset)
-
-    if int(os.environ["RANK"]) == 0:
-        torch.save(learned_model, 'model.pt')
-        return 'model.pt'
-
-    return None
-```
-
-```python
-import torchrunx as trx
-
-model_path = trx.launch(
-    func=train_model,
-    func_kwargs={'model': my_model, 'training_dataset': mnist_train},
-    hostnames=["localhost", "other_node"],
-    workers_per_host=2
-)["localhost"][0]  # return from rank 0 (first worker on "localhost")
-```
-
-## Why should I use this?
-
-[`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) is a hammer. `torchrunx` is a chisel.
-
-Whether you have 1 GPU, 8 GPUs, or 8 machines:
-
-Convenience:
+  ```python
+  import os
+  import torch
 
-- If you don't want to set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself
-- If you want to run `python myscript.py` instead of `torchrun myscript.py`
-- If you don't want to manually SSH and run `torchrun --master-ip --master-port ...` on every machine (and if you don't want to babysit these machines for hanging failures)
+  def train():
+      rank = int(os.environ['RANK'])
+      local_rank = int(os.environ['LOCAL_RANK'])
 
-Robustness:
+      model = torch.nn.Linear(10, 10).to(local_rank)
+      ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
+      optimizer = torch.optim.AdamW(ddp_model.parameters())
 
-- If you want to run a complex, _modular_ workflow in one script
-  - no worries about memory leaks or OS failures
-  - don't parallelize your entire script: just the functions you want
-
-Features:
+      optimizer.zero_grad()
+      outputs = ddp_model(torch.randn(5, 10))
+      labels = torch.randn(5, 10).to(local_rank)
+      torch.nn.functional.mse_loss(outputs, labels).backward()
+      optimizer.step()
 
-- Our launch utility is super _Pythonic_
-- If you want to run distributed PyTorch functions from Python Notebooks.
-- Automatic integration with SLURM
+      if rank == 0:
+          return model
+  ```
 
-Why not?
+  You could also use `transformers.Trainer` (or similar) to automatically handle all the multi-GPU / DDP code above.
+</details>
 
-- We don't support fault tolerance via torch elastic. Probably only useful if you are using 1000 GPUs. Maybe someone can make a PR.
 
-## More complicated example
+```python
+import torchrunx as trx
 
-We could also launch multiple functions, with different GPUs:
+if __name__ == "__main__":
+    trained_model = trx.launch(
+        func=train,
+        hostnames=["localhost", "other_node"],
+        workers_per_host=2  # num. GPUs
+    ).value(rank=0)  # get returned object
 
-```python
-def train_model(model, dataset):
-    trained_model = train(model, dataset)
+    torch.save(trained_model.state_dict(), "model.pth")
+```
 
-    if int(os.environ["RANK"]) == 0:
-        torch.save(learned_model, 'model.pt')
-        return 'model.pt'
+### [Full API](https://torchrunx.readthedocs.io/stable/api.html)
+### [Advanced Usage](https://torchrunx.readthedocs.io/stable/advanced.html)
 
-    return None
+## Why should I use this?
 
-def test_model(model_path, test_dataset):
-    model = torch.load(model_path)
-    accuracy = inference(model, test_dataset)
-    return accuracy
-```
+Whether you have 1 GPU, 8 GPUs, or 8 machines.
 
-```python
-import torchrunx as trx
+__Features:__
 
-model_path = trx.launch(
-    func=train_model,
-    func_kwargs={'model': my_model, 'training_dataset': mnist_train},
-    hostnames=["localhost", "other_node"],
-    workers_per_host=2
-)["localhost"][0]  # return from rank 0 (first worker on "localhost")
+- Our [`launch()`](https://torchrunx.readthedocs.io/stable/api.html#torchrunx.launch) utility is super _Pythonic_
+    - Return objects from your workers
+    - Run `python script.py` instead of `torchrun script.py`
+    - Launch multi-node functions, even from Python Notebooks
+- Fine-grained control over logging, environment variables, exception handling, etc.
+- Automatic integration with SLURM
 
+__Robustness:__
 
+- If you want to run a complex, _modular_ workflow in __one__ script
+  - don't parallelize your entire script: just the functions you want!
+  - no worries about memory leaks or OS failures
 
-accuracy = trx.launch(
-    func=test_model,
-    func_kwargs={'model': learned_model, 'test_dataset': mnist_test},
-    hostnames=["localhost"],
-    workers_per_host=1
-)["localhost"][0]
+__Convenience:__
 
-print(f'Accuracy: {accuracy}')
-```
+- If you don't want to:
+  - set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself
+  - manually SSH into every machine and `torchrun --master-ip --master-port ...`, babysit failed processes, etc.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,4 +1,4 @@
 sphinx==6.2.1
 furo
 myst-parser
-sphinx-toolbox
+sphinx-toolbox
diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
@@ -1,6 +1,33 @@
 Advanced Usage
 ==============
 
+Multiple functions in one script
+--------------------------------
+
+We could also launch multiple functions (e.g. train on many GPUs, test on one GPU):
+
+.. code-block:: python
+
+    import torchrunx as trx
+
+    trained_model = trx.launch(
+        func=train,
+        hostnames=["node1", "node2"],
+        workers_per_host=8
+    ).value(rank=0)
+
+    accuracy = trx.launch(
+        func=test,
+        func_kwargs={'model': model},
+        hostnames=["localhost"],
+        workers_per_host=1
+    ).value(rank=0)
+
+    print(f'Accuracy: {accuracy}')
+
+``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.
+
+
 Environment Detection
 ---------------------
 
@@ -61,18 +88,9 @@ For example, the `python ... --help` command will then result in:
 Custom Logging
 --------------
 
-Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a :mod:`torchrunx.DefaultLogSpec` is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
-
-Custom logging classes can be subclassed from the :mod:`torchrunx.LogSpec` class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The :mod:`torchrunx.DefaultLogSpec` maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
-
-.. autoclass:: torchrunx.LogSpec
-    :members:
-
-.. autoclass:: torchrunx.DefaultLogSpec
-    :members:
+Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
 
-.. 
-    TODO: example log structure
+Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
 
 Propagating Exceptions
 ----------------------

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -1,8 +1,7 @@
 API
 =============
 
-..
-    TODO: examples, environmental variables available to workers (e.g. RANK, LOCAL_RANK)
+.. autofunction:: torchrunx.launch(func: Callable, ...)
 
-.. automodule:: torchrunx
-    :members: launch, slurm_hosts, slurm_workers
+.. autoclass:: torchrunx.LaunchResult
+  :members:
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -20,8 +20,13 @@
     'myst_parser',
     'sphinx_toolbox.sidebar_links',
     'sphinx_toolbox.github',
+    'sphinx.ext.autodoc.typehints',
+    #"sphinx_autodoc_typehints",
 ]
 
+autodoc_typehints = "both"
+#typehints_defaults = 'comma'
+
 github_username = 'apoorvkh'
 github_repository = 'torchrunx'
 
@@ -43,4 +48,4 @@
 epub_show_urls = 'footnote'
 
 # code block syntax highlighting
-#pygments_style = 'sphinx'
+#pygments_style = 'sphinx'
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,14 +1,9 @@
-Getting Started
-===============
-
 .. include:: ../../README.md
    :parser: myst_parser.sphinx_
 
-Contents
---------
-
 .. toctree::
-   :maxdepth: 2
+   :hidden:
+   :maxdepth: 1
 
    api
    advanced
@@ -17,4 +12,4 @@ Contents
 
 .. sidebar-links::
    :github: 
-   :pypi: torchrunx
+   :pypi: torchrunx
diff --git a/pixi.lock b/pixi.lock
diff --git a/src/torchrunx/__init__.py b/src/torchrunx/__init__.py
@@ -1,9 +1,10 @@
-from .launcher import Launcher, launch
+from .launcher import Launcher, LaunchResult, launch
 from .logging_utils import add_filter_to_handler, file_handler, stream_handler
 
 __all__ = [
     "Launcher",
     "launch",
+    "LaunchResult",
     "add_filter_to_handler",
     "file_handler",
     "stream_handler",

diff --git a/src/torchrunx/agent.py b/src/torchrunx/agent.py
@@ -73,7 +73,7 @@ def entrypoint(serialized_worker_args: SerializedWorkerArgs) -> Any | WorkerExce
     os.environ["WORLD_SIZE"] = str(worker_args.world_size)
     os.environ["MASTER_ADDR"] = worker_args.main_agent_hostname
     os.environ["MASTER_PORT"] = str(worker_args.main_agent_port)
-    
+
     if worker_args.backend is not None:
         backend = worker_args.backend
         if backend == "auto":