apoorvkh · apoorvkh · Oct 30, 2024 · Oct 19, 2024 · Oct 19, 2024 · Oct 20, 2024
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,3 +1,17 @@
 # Contributing
 
-We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository. We use `ruff` for linting and formatting, `pyright` for static type checking, and `pytest` for testing. We build for `PyPI`. Our release pipeline is powered by Github Actions.
+We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository to activate the environment.
+
+We use `ruff check` for linting, `ruff format` for formatting, `pyright` for static type checking, and `pytest` for testing.
+
+We build wheels with `python -m build` and upload to [PyPI](https://pypi.org/project/torchrunx) with [twine](https://twine.readthedocs.io). Our release pipeline is powered by Github Actions.
+
+## Pull Requests
+
+Make a pull request with your changes on Github and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in __torchrunx__.
+
+## Testing
+
+`tests/` contains `pytest`-style tests for validating that code changes do not break the core functionality of our library.
+
+At the moment, we run `pytest tests/test_ci.py` (i.e. simple single-node CPU-only tests) in our Github Actions CI pipeline (`.github/workflows/release.yml`). One can manually run our more involved tests (on GPUs, on multiple machines from SLURM) on their own hardware.
diff --git a/README.md b/README.md
@@ -56,12 +56,13 @@ Here's a simple example where we "train" a model on two nodes (with 2 GPUs each)
 import torchrunx as trx
 
 if __name__ == "__main__":
-    trained_model = trx.launch(
+    result = trx.launch(
         func=train,
         hostnames=["localhost", "other_node"],
-        workers_per_host=2  # num. GPUs
-    ).value(rank=0)  # get returned object
+        workers_per_host=2  # number of GPUs
+    )
 
+    trained_model = result.rank(0)
     torch.save(trained_model.state_dict(), "model.pth")
 ```
 
@@ -70,9 +71,9 @@ if __name__ == "__main__":
 
 ## Why should I use this?
 
-Whether you have 1 GPU, 8 GPUs, or 8 machines.
+Whether you have 1 GPU, 8 GPUs, or 8 machines:
 
-__Features:__
+__Features__
 
 - Our [`launch()`](https://torchrunx.readthedocs.io/stable/api.html#torchrunx.launch) utility is super _Pythonic_
     - Return objects from your workers
@@ -81,13 +82,13 @@ __Features:__
 - Fine-grained control over logging, environment variables, exception handling, etc.
 - Automatic integration with SLURM
 
-__Robustness:__
+__Robustness__
 
 - If you want to run a complex, _modular_ workflow in __one__ script
   - don't parallelize your entire script: just the functions you want!
   - no worries about memory leaks or OS failures
 
-__Convenience:__
+__Convenience__
 
 - If you don't want to:
   - set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself

diff --git a/docs/source/advanced.rst b/docs/source/advanced.rst
@@ -14,101 +14,101 @@ We could also launch multiple functions (e.g. train on many GPUs, test on one GP
         func=train,
         hostnames=["node1", "node2"],
         workers_per_host=8
-    ).value(rank=0)
+    ).rank(0)
 
     accuracy = trx.launch(
         func=test,
-        func_kwargs={'model': model},
+        func_args=(trained_model,),
         hostnames=["localhost"],
         workers_per_host=1
-    ).value(rank=0)
+    ).rank(0)
 
     print(f'Accuracy: {accuracy}')
 
-``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.
 
+:mod:`torchrunx.launch` is self-cleaning: all processes are terminated (and the used memory is completely released) before the subsequent invocation.
 
-Environment Detection
----------------------
-
-By default, the `hostnames` or `workers_per_host` :mod:`torchrunx.launch` parameters are set to "auto". These parameters are populated via `SLURM`_ if a SLURM environment is automatically detected. Otherwise, `hostnames = ["localhost"]` and `workers_per_host` is set to the number of GPUs or CPUs (in order of precedence) available locally.
-
-SLURM
-+++++
-
-If the `hostnames` or `workers_per_host` parameters are set to `"slurm"`, their values will be filled from the SLURM job. Passing `"slurm"` raises a `RuntimeError` if no SLURM allocation is detected from the environment.
-
-``Launcher`` class
-------------------
+Launcher class
+--------------
 
-We provide the ``torchrunx.Launcher`` class as an alternative to ``torchrunx.launch``.
+We provide the :mod:`torchrunx.Launcher` class as an alias to :mod:`torchrunx.launch`.
 
 .. autoclass:: torchrunx.Launcher
-    :members:
-.. .. autofunction:: torchrunx.Launcher.run
+  :members:
 
-CLI Support
-+++++++++++
+CLI integration
+^^^^^^^^^^^^^^^
 
-This allows **torchrunx** arguments to be more easily populated by CLI packages like `tyro <https://brentyi.github.io/tyro/>`_:
+We can use :mod:`torchrunx.Launcher` to populate arguments from the CLI (e.g. with `tyro <https://brentyi.github.io/tyro/>`_):
 
 .. code:: python
 
     import torchrunx as trx
     import tyro
 
     def distributed_function():
-        print("Hello world!")
+        pass
 
     if __name__ == "__main__":
         launcher = tyro.cli(trx.Launcher)
-        launcher.run(distributed_function, {})
+        launcher.run(distributed_function)
 
-For example, the `python ... --help` command will then result in:
+``python ... --help`` then results in:
 
 .. code:: bash
 
-    ╭─ options ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
-    │ -h, --help              show this help message and exit                                                       │
-    │ --hostnames {[STR [STR ...]]}|{auto,slurm}                                                                    │
-    │                         (default: auto)                                                                       │
-    │ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm}                                                         │
-    │                         (default: auto)                                                                       │
-    │ --ssh-config-file {None}|STR|PATH                                                                             │
-    │                         (default: None)                                                                       │
-    │ --backend {None,nccl,gloo,mpi,ucc,auto}                                                                       │
-    │                         (default: auto)                                                                       │
-    │ --log-handlers {fixed}  (fixed to: a u t o)                                                                   │
-    │ --env-vars STR          (default: PATH LD_LIBRARY LIBRARY_PATH 'PYTHON*' 'CUDA*' 'TORCH*' 'PYTORCH*' 'NCCL*') │
-    │ --env-file {None}|STR|PATH                                                                                    │
-    │                         (default: None)                                                                       │
-    │ --timeout INT           (default: 600)                                                                        │
-    ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
-
-Custom Logging
---------------
+    ╭─ options ─────────────────────────────────────────────╮
+    │ -h, --help           show this help message and exit  │
+    │ --hostnames {[STR [STR ...]]}|{auto,slurm}            │
+    │                      (default: auto)                  │
+    │ --workers-per-host INT|{[INT [INT ...]]}|{auto,slurm} │
+    │                      (default: auto)                  │
+    │ --ssh-config-file {None}|STR|PATH                     │
+    │                      (default: None)                  │
+    │ --backend {None,nccl,gloo,mpi,ucc,auto}               │
+    │                      (default: auto)                  │
+    │ --timeout INT        (default: 600)                   │
+    │ --default-env-vars [STR [STR ...]]                    │
+    │                      (default: PATH LD_LIBRARY ...)   │
+    │ --extra-env-vars [STR [STR ...]]                      │
+    │                      (default: )                      │
+    │ --env-file {None}|STR|PATH                            │
+    │                      (default: None)                  │
+    ╰───────────────────────────────────────────────────────╯
+
+SLURM integration
+-----------------
+
+By default, the ``hostnames`` or ``workers_per_host`` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (localhost) with N workers (num. GPUs or CPUs).
+Raises a ``RuntimeError`` if ``hostnames="slurm"`` or ``workers_per_host="slurm"`` but no allocation is detected.
+
+Propagating exceptions
+----------------------
 
-Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
+Exceptions that are raised in workers will be raised by the launcher process.
 
-Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
+A :mod:`torchrunx.AgentFailedError` or :mod:`torchrunx.WorkerFailedError` will be raised if any agent or worker dies unexpectedly (e.g. if sent a signal from the OS, due to segmentation faults or OOM).
 
-Propagating Exceptions
-----------------------
+Environment variables
+---------------------
 
-Exceptions that are raised in Workers will be raised in the Launcher process and can be caught by wrapping :mod:`torchrunx.launch` in a try-except clause.
+Environment variables in the launcher process that match the ``default_env_vars`` argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. Environment variables are pattern-matched with this list using ``fnmatch``.
 
-If a worker is killed by the operating system (e.g. due to Segmentation Fault or SIGKILL by running out of memory), the Launcher process raises a RuntimeError.
+``default_env_vars`` can be overriden if desired. This list can be augmented using ``extra_env_vars``. Additional environment variables (and more custom bash logic) can be included via the ``env_file`` argument. Our agents ``source`` this file.
 
-Environment Variables
----------------------
+We also set the following environment variables in each worker: ``LOCAL_RANK``, ``RANK``, ``LOCAL_WORLD_SIZE``, ``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.
+
+Custom logging
+--------------
+
+We forward all logs (i.e. from :mod:`logging` and :mod:`sys.stdout`/:mod:`sys.stderr`) from workers and agents to the launcher. By default, the logs from the first agent and its first worker are printed into the launcher's ``stdout`` stream. Logs from all agents and workers are written to files in ``$TORCHRUNX_LOG_DIR`` (default: ``./torchrunx_logs``) and are named by timestamp, hostname, and local_rank.
+
+:mod:`logging.Handler` objects can be provided via the ``log_handlers`` argument to provide further customization (mapping specific agents/workers to custom output streams).
 
-The :mod:`torchrunx.launch` ``env_vars`` argument allows the user to specify which environmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
+We provide some utilities to help:
 
-:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
+.. autofunction:: torchrunx.file_handler
 
-..
-    TODO: example env_file
+.. autofunction:: torchrunx.stream_handler
 
-Support for Numpy >= 2.0
-------------------------
-only supported if `torch>=2.3`
+.. autofunction:: torchrunx.add_filter_to_handler
diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -6,4 +6,6 @@ API
 .. autoclass:: torchrunx.LaunchResult
   :members:
 
-.. autoclass:: torchrunx.AgentKilledError
+.. autoclass:: torchrunx.AgentFailedError
+
+.. autoclass:: torchrunx.WorkerFailedError
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -1,51 +1,96 @@
 import os
 import sys
 
-sys.path.insert(0, os.path.abspath('../../src'))
+sys.path.insert(0, os.path.abspath("../../src"))
 
 # Configuration file for the Sphinx documentation builder.
 
-# -- Project information
-
-project = 'torchrunx'
-
-# -- General configuration
+project = "torchrunx"
+github_username = "apoorvkh"
+github_repository = "torchrunx"
+html_theme = "furo"
 
 extensions = [
-    'sphinx.ext.duration',
-    'sphinx.ext.doctest',
-    'sphinx.ext.autodoc',
-    'sphinx.ext.autosummary',
-    'sphinx.ext.intersphinx',
-    'myst_parser',
-    'sphinx_toolbox.sidebar_links',
-    'sphinx_toolbox.github',
-    'sphinx.ext.autodoc.typehints',
-    #"sphinx_autodoc_typehints",
+    "sphinx.ext.duration",
+    "sphinx.ext.autodoc",
+    "sphinx.ext.intersphinx",
+    "myst_parser",
+    "sphinx_toolbox.sidebar_links",
+    "sphinx_toolbox.github",
+    "sphinx.ext.napoleon",
+    "sphinx.ext.autodoc.typehints",
+    "sphinx.ext.linkcode",
 ]
 
+autodoc_mock_imports = ["torch", "fabric", "cloudpickle", "sys", "logging", "typing_extensions"]
 autodoc_typehints = "both"
-#typehints_defaults = 'comma'
-
-github_username = 'apoorvkh'
-github_repository = 'torchrunx'
+autodoc_typehints_description_target = "documented_params"
 
-autodoc_mock_imports = ['torch', 'fabric', 'cloudpickle', 'typing_extensions']
+maximum_signature_line_length = 100
 
 intersphinx_mapping = {
-    'python': ('https://docs.python.org/3/', None),
-    'sphinx': ('https://www.sphinx-doc.org/en/master/', None),
+    "python": ("https://docs.python.org/3/", None),
 }
-intersphinx_disabled_domains = ['std']
+intersphinx_disabled_domains = ["std"]
+
+
+## Link code to Github source
+# From: https://github.com/scikit-learn/scikit-learn/blob/main/doc/sphinxext/github_link.py
+
+import inspect
+import os
+import subprocess
+import sys
+from operator import attrgetter
+
+package = project
+
+try:
+    revision = (
+        subprocess.check_output("git rev-parse --short HEAD".split()).strip().decode("utf-8")
+    )
+except (subprocess.CalledProcessError, OSError):
+    print("Failed to execute git to get revision")
+    revision = None
+
+url_fmt = (
+    f"https://github.com/{github_username}/{github_repository}/"
+    "blob/{revision}/src/{package}/{path}#L{lineno}"
+)
+
+def linkcode_resolve(domain, info):
+    if revision is None:
+        return
+    if domain not in ("py", "pyx"):
+        return
+    if not info.get("module") or not info.get("fullname"):
+        return
 
-templates_path = ['_templates']
+    class_name = info["fullname"].split(".")[0]
+    module = __import__(info["module"], fromlist=[class_name])
+    obj = attrgetter(info["fullname"])(module)
 
-# -- Options for HTML output
+    # Unwrap the object to get the correct source
+    # file in case that is wrapped by a decorator
+    obj = inspect.unwrap(obj)
 
-html_theme = 'furo'
+    try:
+        fn = inspect.getsourcefile(obj)
+    except Exception:
+        fn = None
+    if not fn:
+        try:
+            fn = inspect.getsourcefile(sys.modules[obj.__module__])
+        except Exception:
+            fn = None
+    if not fn:
+        return
 
-# -- Options for EPUB output
-epub_show_urls = 'footnote'
+    fn = os.path.relpath(fn, start=os.path.dirname(__import__(package).__file__))
+    try:
+        lineno = inspect.getsourcelines(obj)[1]
+    except Exception:
+        lineno = ""
+    return url_fmt.format(revision=revision, package=package, path=fn, lineno=lineno)
 
-# code block syntax highlighting
-#pygments_style = 'sphinx'
+## End of "link code to Github source"
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -1,20 +1,2 @@
-Contributing
-============
-
 .. include:: ../../CONTRIBUTING.md
    :parser: myst_parser.sphinx_
-
-.. Development environment
-.. -----------------------
-
-.. Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing. 
-
-.. Testing 
-.. -------
-
-.. ``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.
-
-.. Contributing
-.. ------------
-
-.. Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.