You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository. We use `ruff` for linting and formatting, `pyright` for static type checking, and `pytest` for testing. We build for `PyPI`. Our release pipeline is powered by Github Actions.
3
+
We use the [`pixi`](https://pixi.sh) package manager. Simply [install `pixi`](https://pixi.sh/latest/#installation) and run `pixi shell` in this repository to activate the environment.
4
+
5
+
We use `ruff check` for linting, `ruff format` for formatting, `pyright` for static type checking, and `pytest` for testing.
6
+
7
+
We build wheels with `python -m build` and upload to [PyPI](https://pypi.org/project/torchrunx) with [twine](https://twine.readthedocs.io). Our release pipeline is powered by Github Actions.
8
+
9
+
## Pull Requests
10
+
11
+
Make a pull request with your changes on Github and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in __torchrunx__.
12
+
13
+
## Testing
14
+
15
+
`tests/` contains `pytest`-style tests for validating that code changes do not break the core functionality of our library.
16
+
17
+
At the moment, we run `pytest tests/test_ci.py` (i.e. simple single-node CPU-only tests) in our Github Actions CI pipeline (`.github/workflows/release.yml`). One can manually run our more involved tests (on GPUs, on multiple machines from SLURM) on their own hardware.
Copy file name to clipboardExpand all lines: docs/source/advanced.rst
+59-59Lines changed: 59 additions & 59 deletions
Original file line number
Diff line number
Diff line change
@@ -14,101 +14,101 @@ We could also launch multiple functions (e.g. train on many GPUs, test on one GP
14
14
func=train,
15
15
hostnames=["node1", "node2"],
16
16
workers_per_host=8
17
-
).value(rank=0)
17
+
).rank(0)
18
18
19
19
accuracy = trx.launch(
20
20
func=test,
21
-
func_kwargs={'model': model},
21
+
func_args=(trained_model,),
22
22
hostnames=["localhost"],
23
23
workers_per_host=1
24
-
).value(rank=0)
24
+
).rank(0)
25
25
26
26
print(f'Accuracy: {accuracy}')
27
27
28
-
``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.
29
28
29
+
:mod:`torchrunx.launch` is self-cleaning: all processes are terminated (and the used memory is completely released) before the subsequent invocation.
30
30
31
-
Environment Detection
32
-
---------------------
33
-
34
-
By default, the `hostnames` or `workers_per_host` :mod:`torchrunx.launch` parameters are set to "auto". These parameters are populated via `SLURM`_ if a SLURM environment is automatically detected. Otherwise, `hostnames = ["localhost"]` and `workers_per_host` is set to the number of GPUs or CPUs (in order of precedence) available locally.
35
-
36
-
SLURM
37
-
+++++
38
-
39
-
If the `hostnames` or `workers_per_host` parameters are set to `"slurm"`, their values will be filled from the SLURM job. Passing `"slurm"` raises a `RuntimeError` if no SLURM allocation is detected from the environment.
40
-
41
-
``Launcher`` class
42
-
------------------
31
+
Launcher class
32
+
--------------
43
33
44
-
We provide the ``torchrunx.Launcher`` class as an alternative to ``torchrunx.launch``.
34
+
We provide the :mod:`torchrunx.Launcher` class as an alias to :mod:`torchrunx.launch`.
45
35
46
36
.. autoclass:: torchrunx.Launcher
47
-
:members:
48
-
.. .. autofunction:: torchrunx.Launcher.run
37
+
:members:
49
38
50
-
CLI Support
51
-
+++++++++++
39
+
CLI integration
40
+
^^^^^^^^^^^^^^^
52
41
53
-
This allows **torchrunx** arguments to be more easily populated by CLI packages like `tyro <https://brentyi.github.io/tyro/>`_:
42
+
We can use :mod:`torchrunx.Launcher` to populate arguments from the CLI (e.g. with `tyro <https://brentyi.github.io/tyro/>`_):
54
43
55
44
.. code:: python
56
45
57
46
import torchrunx as trx
58
47
import tyro
59
48
60
49
defdistributed_function():
61
-
print("Hello world!")
50
+
pass
62
51
63
52
if__name__=="__main__":
64
53
launcher = tyro.cli(trx.Launcher)
65
-
launcher.run(distributed_function, {})
54
+
launcher.run(distributed_function)
66
55
67
-
For example, the `python ... --help` command will then result in:
By default, the ``hostnames`` or ``workers_per_host`` arguments are populated from the current SLURM allocation. If no allocation is detected, we assume 1 machine (localhost) with N workers (num. GPUs or CPUs).
83
+
Raises a ``RuntimeError``if``hostnames="slurm"`` or ``workers_per_host="slurm"`` but no allocation is detected.
84
+
85
+
Propagating exceptions
86
+
----------------------
90
87
91
-
Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
88
+
Exceptions that are raised in workers will be raised by the launcher process.
92
89
93
-
Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
90
+
A :mod:`torchrunx.AgentFailedError` or :mod:`torchrunx.WorkerFailedError` will be raised if any agent or worker dies unexpectedly (e.g. if sent a signal from the OS, due to segmentation faults or OOM).
94
91
95
-
Propagating Exceptions
96
-
----------------------
92
+
Environment variables
93
+
---------------------
97
94
98
-
Exceptions that are raised inWorkers will be raised in the Launcher process and can be caught by wrapping :mod:`torchrunx.launch`in a try-except clause.
95
+
Environment variables inthe launcher process that match the ``default_env_vars`` argument are automatically copied to agents and workers. We set useful defaults for Python and PyTorch. Environment variables are pattern-matched with this list using ``fnmatch``.
99
96
100
-
If a worker is killed by the operating system (e.g. due to Segmentation Fault or SIGKILL by running out of memory), the Launcher process raises a RuntimeError.
97
+
``default_env_vars`` can be overriden if desired. This list can be augmented using ``extra_env_vars``. Additional environment variables (and more custom bash logic) can be included via the ``env_file`` argument. Our agents ``source`` this file.
101
98
102
-
Environment Variables
103
-
---------------------
99
+
We also set the following environment variables in each worker: ``LOCAL_RANK``, ``RANK``, ``LOCAL_WORLD_SIZE``, ``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.
100
+
101
+
Custom logging
102
+
--------------
103
+
104
+
We forward all logs (i.e. from :mod:`logging` and :mod:`sys.stdout`/:mod:`sys.stderr`) from workers and agents to the launcher. By default, the logs from the first agent and its first worker are printed into the launcher's ``stdout`` stream. Logs from all agents and workers are written to files in ``$TORCHRUNX_LOG_DIR`` (default: ``./torchrunx_logs``) and are named by timestamp, hostname, and local_rank.
105
+
106
+
:mod:`logging.Handler` objects can be provided via the ``log_handlers`` argument to provide further customization (mapping specific agents/workers to custom output streams).
104
107
105
-
The :mod:`torchrunx.launch```env_vars`` argument allows the user to specify which environmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
108
+
We provide some utilities to help:
106
109
107
-
:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
.. Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing.
11
-
12
-
.. Testing
13
-
.. -------
14
-
15
-
.. ``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.
16
-
17
-
.. Contributing
18
-
.. ------------
19
-
20
-
.. Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.
0 commit comments