You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
)["localhost"][0] # return from rank 0 (first worker on "localhost")
47
-
```
48
-
49
-
## Why should I use this?
50
-
51
-
[`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) is a hammer. `torchrunx` is a chisel.
52
-
53
-
Whether you have 1 GPU, 8 GPUs, or 8 machines:
54
-
55
-
Convenience:
29
+
```python
30
+
import os
31
+
import torch
56
32
57
-
- If you don't want to set up [`dist.init_process_group`](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) yourself
58
-
- If you want to run `python myscript.py` instead of `torchrun myscript.py`
59
-
- If you don't want to manually SSH and run `torchrun --master-ip --master-port ...` on every machine (and if you don't want to babysit these machines for hanging failures)
Copy file name to clipboardExpand all lines: docs/source/advanced.rst
+29-11Lines changed: 29 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,33 @@
1
1
Advanced Usage
2
2
==============
3
3
4
+
Multiple functions in one script
5
+
--------------------------------
6
+
7
+
We could also launch multiple functions (e.g. train on many GPUs, test on one GPU):
8
+
9
+
.. code-block:: python
10
+
11
+
import torchrunx as trx
12
+
13
+
trained_model = trx.launch(
14
+
func=train,
15
+
hostnames=["node1", "node2"],
16
+
workers_per_host=8
17
+
).value(rank=0)
18
+
19
+
accuracy = trx.launch(
20
+
func=test,
21
+
func_kwargs={'model': model},
22
+
hostnames=["localhost"],
23
+
workers_per_host=1
24
+
).value(rank=0)
25
+
26
+
print(f'Accuracy: {accuracy}')
27
+
28
+
``trx.launch()`` is self-cleaning: all processes are terminated (and the used memory is completely released) after each invocation.
29
+
30
+
4
31
Environment Detection
5
32
---------------------
6
33
@@ -61,18 +88,9 @@ For example, the `python ... --help` command will then result in:
61
88
Custom Logging
62
89
--------------
63
90
64
-
Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a :mod:`torchrunx.DefaultLogSpec` is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
65
-
66
-
Custom logging classes can be subclassed from the :mod:`torchrunx.LogSpec` class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The :mod:`torchrunx.DefaultLogSpec` maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
67
-
68
-
.. autoclass:: torchrunx.LogSpec
69
-
:members:
70
-
71
-
.. autoclass:: torchrunx.DefaultLogSpec
72
-
:members:
91
+
Logs are generated at the worker and agent level, and are specified to :mod:`torchrunx.launch` via the ``log_spec`` argument. By default, a is instantiated, causing logs at the worker and agent levels to be logged to files under ``'./logs'``, and the rank 0 worker's output streams are streamed to the launcher ``stdout``. Logs are prefixed with a timestamp by default. Agent logs have the format ``{timestamp}-{agent hostname}.log`` and workers have the format ``{timestamp}-{agent hostname}[{worker local rank}].log``.
73
92
74
-
..
75
-
TODO: example log structure
93
+
Custom logging classes can be subclassed from the class. Any subclass must have a ``get_map`` method returning a dictionary mapping logger names to lists of :mod:`logging.Handler` objects, in order to be passed to :mod:`torchrunx.launch`. The logger names are of the format ``{agent hostname}`` for agents and ``{agent hostname}[{worker local rank}]`` for workers. The maps all the loggers to :mod:`logging.Filehandler` object pointing to the files mentioned in the previous paragraph. It additionally maps the global rank 0 worker to a :mod:`logging.StreamHandler`, which writes logs the launcher's ``stdout`` stream.
0 commit comments