You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In addition to ``torchrunx.launch``, we provide the ``torchrunx.Launcher`` dataclass. This allows **torchrunx** arguments to be more easily populated by CLI packages like ``tyro``:
5
+
6
+
.. code:: python
7
+
8
+
import torchrunx as trx
9
+
import tyro
10
+
11
+
defdistributed_function():
12
+
print("Hello world!")
13
+
14
+
if__name__=="__main__":
15
+
launcher = tyro.cli(trx.Launcher)
16
+
launcher.run(distributed_function, {})
17
+
18
+
.. autoclass:: torchrunx.Launcher
19
+
20
+
.. autofunction:: torchrunx.Launcher.run
21
+
22
+
Logging
23
+
-------
24
+
25
+
All logs are generated in the folder provided as the ``logs`` argument to :mod:`torchrunx.launch`. Each worker agent generates a log, named based on the current date and time, followed by the agent hostname. Each worker also has a log, named identically to their agent's log file except for the addition of the worker's local rank at the end of the name. Each agent includes the output from local worker 0 in its log. The launcher renders agent 0's log to ``stdout`` in real time.
26
+
27
+
..
28
+
TODO: example log structure
29
+
30
+
Worker environment
31
+
------------------
32
+
33
+
The :mod:`torchrunx.launch` ``env_vars`` argument allows the user to specify which evnironmental variables should be copied to the agents from the launcher environment. By default, it attempts to copy variables related to Python and important packages/technologies that **torchrunx** uses such as PyTorch, NCCL, CUDA, and more. Strings provided are matched with the names of environmental variables using ``fnmatch`` - standard UNIX filename pattern matching. The variables are inserted into the agent environments, and then copied to workers' environments when they are spawned.
34
+
35
+
:mod:`torchrunx.launch` also accepts the ``env_file`` argument, which is designed to expose more advanced environmental configuration to the user. When a file is provided as this argument, the launcher will source the file on each node before executing the agent. This allows for custom bash scripts to be provided in the environmental variables, and allows for node-specific environmental variables to be set.
Ensure you have the latest development environment installed. After cloning our repository, `install pixi <https://pixi.sh/latest/#installation>`_ and run ``pixi shell`` in the repo's root directory. Additionally, we use `ruff <https://github.com/astral-sh/ruff>`_ for linting and formatting, `pyright <https://github.com/microsoft/pyright>`_ for type checking, and ``pytest`` for testing.
8
+
9
+
Testing
10
+
-------
11
+
12
+
``tests/`` contains ``pytest``-style tests for validating that code changes do not break the core functionality of **torchrunx**. At the moment, we have a few simple CI tests powered by Github action, which are limited to single-agent CPU-only tests due to Github's infrastructure.
13
+
14
+
Contributing
15
+
------------
16
+
17
+
Make a pull request with your changes and we'll try to look at soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.
In order to organize processes on different nodes, **torchrunx** maintains the following hierarchy:
5
+
6
+
#. The launcher, the process in which ``torchrunx.Launcher.run`` is executed: Connects to remote hosts and initializes and configures "agents", passes errors and return values from agents to the caller, and is responsible for cleaning up.
7
+
#. The agents, initialized on machines where computation is to be performed: Responsible for starting and monitoring "workers".
8
+
#. The workers, spawned by agents: Responsible for initializing a ``torch.distributed`` process group, and running the distributed function provided by the user.
9
+
10
+
An example of how this hierarchy might look in practice is the following:
11
+
Suppose we wish to distribute a training function over four GPUs, and we have access to a cluster where nodes have two available GPUs each. Say that a single instance of our training function can leverage multiple GPUs. We can choose two available nodes and use the launcher to launch our function on those two nodes, specifying that we only need one worker per node, since a single instance of our training function can use both GPUs on each node. The launcher will launch an agent on each node and pass our configuration to the agents, after which the agents will each initialize one worker to begin executing the training function. We could also run two workers per node, each with one GPU, giving us four workers, although this would be slower.
12
+
13
+
The launcher initializes the agents by simply SSHing into the provided hosts, and executing our agent code there. The launcher also provides key environmental variables from the launch environment to the sessions where the agents are started and tries to activate the same Python environment that was used to execute the launcher. This is one reason why all machines either running a launcher or agent process should share a filesystem.
14
+
15
+
The launcher and agents perform exception handling such that any exceptions in the worker processes are appropriately raised by the launcher process. The launcher and agents communicate using a ``torch.distributed`` process group, separate from the group that the workers use.
Copy file name to clipboardExpand all lines: src/torchrunx/launcher.py
+28-17Lines changed: 28 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -103,30 +103,16 @@ def run(
103
103
func_kwargs: dict[str, Any],
104
104
) ->dict[int, Any]:
105
105
"""
106
-
Launch a distributed pytorch function on the specified nodes.
106
+
Launch a distributed PyTorch function on the specified nodes. See :mod:`torchrunx.launch`
107
107
108
108
:param func: The distributed function to call on all workers
109
109
:type func: Callable
110
110
:param func_kwargs: Any keyword arguments to be provided when calling ``func``
111
111
:type func_kwargs: dict[str, Any]
112
-
:param hostnames: A list of node hostnames to start workers on, defaults to ["localhost"]
113
-
:type hostnames: list[str], optional
114
-
:param workers_per_host: The number of workers per node. Providing an ``int`` implies all nodes should have ``workers_per_host`` workers, meanwhile providing a list causes node ``i`` to have ``worker_per_host[i]`` workers, defaults to 1
115
-
:type workers_per_host: int | list[int], optional
116
-
:param ssh_config_file: An SSH configuration file to use when connecting to nodes, defaults to None
:param backend: A ``torch.distributed`` `backend string <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Backend>`_, defaults to None
:param log_dir: A directory in which logs should be written, defaults to "./logs"
121
-
:type log_dir: os.PathLike | str, optional
122
-
:param env_vars: A list of environmental variables to be copied from the launcher environment to workers. Allows for bash pattern matching syntax, defaults to ["PATH", "LD_LIBRARY", "LIBRARY_PATH", "PYTHON*", "CUDA*", "TORCH*", "PYTORCH*", "NCCL*"]
123
-
:type env_vars: list[str], optional
124
-
:param env_file: An additional environment file that will be sourced prior to executing ``func``, defaults to None
:raises RuntimeError: May fail due to misconfiguration, or errors thrown by ``func``
127
113
:return: A dictionary mapping worker ranks to their output
128
114
:rtype: dict[int, Any]
129
-
"""# noqa: E501
115
+
"""
130
116
ifnotdist.is_available():
131
117
raiseRuntimeError("The torch.distributed package is not available.")
132
118
@@ -284,7 +270,32 @@ def launch(
284
270
"NCCL*",
285
271
],
286
272
env_file: str|os.PathLike|None=None,
287
-
):
273
+
) ->dict[int, Any]:
274
+
"""
275
+
Launch a distributed PyTorch function on the specified nodes.
276
+
277
+
:param func: The distributed function to call on all workers
278
+
:type func: Callable
279
+
:param func_kwargs: Any keyword arguments to be provided when calling ``func``
280
+
:type func_kwargs: dict[str, Any]
281
+
:param hostnames: A list of node hostnames to start workers on, defaults to ["localhost"]
282
+
:type hostnames: list[str], optional
283
+
:param workers_per_host: The number of workers per node. Providing an ``int`` implies all nodes should have ``workers_per_host`` workers, meanwhile providing a list causes node ``i`` to have ``worker_per_host[i]`` workers, defaults to 1
284
+
:type workers_per_host: int | list[int], optional
285
+
:param ssh_config_file: An SSH configuration file to use when connecting to nodes, defaults to None
:param backend: A ``torch.distributed`` `backend string <https://pytorch.org/docs/stable/distributed.html#torch.distributed.Backend>`_, defaults to None
:param log_dir: A directory in which logs should be written, defaults to "./logs"
290
+
:type log_dir: os.PathLike | str, optional
291
+
:param env_vars: A list of environmental variables to be copied from the launcher environment to workers. Allows for bash pattern matching syntax, defaults to ["PATH", "LD_LIBRARY", "LIBRARY_PATH", "PYTHON*", "CUDA*", "TORCH*", "PYTORCH*", "NCCL*"]
292
+
:type env_vars: list[str], optional
293
+
:param env_file: An additional environment file that will be sourced prior to executing ``func``, defaults to None
0 commit comments