|
1 | 1 | How it works
|
2 | 2 | ============
|
3 | 3 |
|
4 |
| -In order to organize processes on different nodes, **torchrunx** maintains the following hierarchy: |
| 4 | +If you want to (e.g.) train your model on several machines with **N** GPUs each, you should run your training function in **N** parallel processes on each machine. During training, each of these processes runs the same training code (i.e. your function) and communicate with each other (e.g. to synchronize gradients) using a `distributed process group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`_. |
5 | 5 |
|
6 |
| -#. The launcher, the process in which ``torchrunx.Launcher.run`` is executed: Connects to remote hosts and initializes and configures "agents", passes errors and return values from agents to the caller, and is responsible for cleaning up. |
7 |
| -#. The agents, initialized on machines where computation is to be performed: Responsible for starting and monitoring "workers". |
8 |
| -#. The workers, spawned by agents: Responsible for initializing a ``torch.distributed`` process group, and running the distributed function provided by the user. |
| 6 | +Your script can call our library (via `mod:torchrunx.launch`) and specify a function to distribute. The main process running your script is henceforth known as the **launcher** process. |
9 | 7 |
|
10 |
| -An example of how this hierarchy might look in practice is the following: |
11 |
| -Suppose we wish to distribute a training function over four GPUs, and we have access to a cluster where nodes have two available GPUs each. Say that a single instance of our training function can leverage multiple GPUs. We can choose two available nodes and use the launcher to launch our function on those two nodes, specifying that we only need one worker per node, since a single instance of our training function can use both GPUs on each node. The launcher will launch an agent on each node and pass our configuration to the agents, after which the agents will each initialize one worker to begin executing the training function. We could also run two workers per node, each with one GPU, giving us four workers, although this would be slower. |
| 8 | +Our launcher process spawns an **agent** process (via SSH) on each machine. Each agent then spawns **N** processes (known as **workers**) on its machine. All workers form a process group (with the specified `mod:torchrunx.launch` ``backend``) and run your function in parallel. |
12 | 9 |
|
13 |
| -The launcher initializes the agents by simply SSHing into the provided hosts, and executing our agent code there. The launcher also provides key environmental variables from the launch environment to the sessions where the agents are started and tries to activate the same Python environment that was used to execute the launcher. This is one reason why all machines either running a launcher or agent process should share a filesystem. |
| 10 | +**Agent–Worker Communication.** Our agents poll their workers every second and time-out if unresponsive for 5 seconds. Upon polling, our agents receive ``None`` (if the worker is still running) or a `RunProcsResult <https://pytorch.org/docs/stable/elastic/multiprocessing.html#torch.distributed.elastic.multiprocessing.api.RunProcsResult>`_, indicating that the workers have either completed (providing an object returned from or the exception raised by our function) or failed (e.g. due to segmentation fault or OS signal). |
14 | 11 |
|
15 |
| -The launcher and agents perform exception handling such that any exceptions in the worker processes are appropriately raised by the launcher process. The launcher and agents communicate using a ``torch.distributed`` process group, separate from the group that the workers use. |
| 12 | +**Launcher–Agent Communication.** The launcher and agents form a distributed group (with the CPU-based `GLOO backend <https://pytorch.org/docs/stable/distributed.html#backends>`_) for the communication purposes of our library. Our agents synchronize their own "statuses" with each other and the launcher. An agent's status can include whether it is running/failed/completed and the result of the function. If the launcher or any agent fails to synchronize, all raise a `mod:torchrunx.AgentFailedError` and terminate. If any worker fails or raises an exception, the launcher raises a `mod:torchrunx.WorkerFailedError` or that exception and terminates along with all the agents. If all agents succeed, the launcher returns the objects returned by each worker. |
0 commit comments