final updates to docs

apoorvkh · apoorvkh · commit 1035b759a3d7 · 2025-03-10T15:24:30.000-04:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -6,7 +6,7 @@ We use `ruff check` for linting, `ruff format` for formatting, `pyright` for sta
 
 ## Pull Requests
 
-Make a pull request with your changes on Github and we'll try to look at it soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in __torchrunx__.
+Make a pull request with your changes on Github and we'll try to look at it soon! If addressing a specific issue, mention it in the PR, and offer a short explanation of your fix. If adding a new feature, explain why it's meaningful and belongs in **torchrunx**.
 
 ## Testing
 
@@ -16,4 +16,4 @@ At the moment, we run `pytest tests/test_ci.py` (i.e. simple single-node CPU-onl
 
 ## Documentation
 
-Our documentation is hosted on Github Pages and is updated with every package release. We build our documentation with `sphinx` using the command: `uv run --group docs python -m sphinx --builder html --doctree-dir docs/_build/.doctrees --conf-dir docs --show-traceback docs/source docs/_build/html`. The documentation will then be generated at `docs/_build/html`.
+Our documentation is hosted on Github Pages and is updated with every package release. We build our documentation with [Sphinx](https://www.sphinx-doc.org): `source scripts/build_docs.sh`. The documentation will then be generated at `docs/_build/html` (and can be rendered with `python -m http.server --directory docs/_build/html`).
diff --git a/README.md b/README.md
@@ -21,20 +21,16 @@ It enables complex workflows within a single script and has useful features even
 pip install torchrunx
 ```
 
-Requires:
-- Linux
-- If using multiple machines: SSH & shared filesystem
+Requires: Linux. If using multiple machines: SSH & shared filesystem.
 
 ---
 
-**Dummy example: parallelizing training with `torchrunx`**
+<h4>Example: simple training loop</h4>
+
+Suppose we have some distributed training function (which needs to run on every GPU):
 
 ```python
-def distributed_training(model: nn.Module, num_steps: int) -> nn.Module:
-    # Environment variables: RANK, LOCAL_RANK, ...
-    # ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
-    ...
-    retun trained_model
+def distributed_training(model: nn.Module, num_steps: int) -> nn.Module: ...
 ```
 
 <details>
@@ -70,14 +66,14 @@ def distributed_training(model: nn.Module, num_steps: int = 10) -> nn.Module | N
 
 </details>
 
+We can distribute and run this function (e.g. on 2 machines x 2 GPUs) using **`torchrunx`**!
+
 ```python
 import torchrunx
 
-# Launch training on 2 machines x 2 GPUs
-
 launcher = torchrunx.Launcher(
-    hostnames = ["localhost", "second_machine"],
-    workers_per_host = 2
+    hostnames = ["localhost", "second_machine"],  # or IP addresses
+    workers_per_host = 2  # e.g. number of GPUs per host
 )
 
 results = launcher.run(
@@ -87,16 +83,17 @@ results = launcher.run(
 )
 ```
 
+Once completed, you can retrieve the results and process them as you wish.
+
 ```python
-# get the results
 trained_model: nn.Module = results.rank(0)
-# or: results.index(hostname="localhost", local_rank=0)
+                     # or: results.index(hostname="localhost", local_rank=0)
 
-# and continue your script — e.g. save model to checkpoint
+# and continue your script
 torch.save(trained_model.state_dict(), "output/model.pth")
 ```
 
-**See examples where we fine-tune LLMs using:**
+**See more examples where we fine-tune LLMs using:**
   - [Transformers](https://torchrun.xyz/examples/transformers.html)
   - [DeepSpeed](https://torchrun.xyz/examples/deepspeed.html)
   - [PyTorch Lightning](https://torchrun.xyz/examples/lightning.html)
diff --git a/docs/source/examples/deepspeed.md b/docs/source/examples/deepspeed.md
@@ -14,7 +14,7 @@ Here's an example script that uses `torchrunx` with [DeepSpeed](https://www.deep
 
 ## Training GPT-2 on WikiText
 
-Deepspeed requires additional (non-Python) dependencies. Use the following commands to set up a project. Source: [Apoorv's Blog — Managing Project Dependencies](https://blog.apoorvkh.com/posts/project-dependencies.html)
+Deepspeed requires additional (non-Python) dependencies. Use the following commands to set up a project. [source: [Apoorv's Blog — Managing Project Dependencies](https://blog.apoorvkh.com/posts/project-dependencies.html)]
 
 Pre-requisite: [pixi](https://pixi.sh)
 
diff --git a/docs/source/how_it_works.md b/docs/source/how_it_works.md
@@ -4,12 +4,16 @@ Suppose you want to run a script (`train.py`) on `N` machines (or "nodes") with
 
 You'll need to start a new process for each GPU. Each process will execute your script in parallel and select its GPU based on the process rank. Your script will also form a [distributed group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) so the processes may communicate with each other (e.g. passing tensors).
 
+## `torchrun`
+
 Normally, you'd do this by running the `torchrun --node-rank {i} ... train.py ...` command on every machine. In short, you'll end up with a topology like:
 
 ![torchrun diagram](./artifacts/torchrun.png)
 
 As a side effect of this structure, every process will run until (1) script completion or (2) another process stops communicating (e.g. if killed by the system for abnormal reasons). The status of other processes is not actively communicated: so if some process is indeed killed, it would take 10 minutes (by default) for the remaining processes to time-out. Also, since this approach parallelizes the entire script, we can't catch and handle these system-level issues as exceptions.
 
+## `torchrunx` 🔥
+
 `torchrunx` offers a functional interface, with a launcher–worker topology, instead.
 
 ![torchrunx diagram](./artifacts/torchrunx.png)
diff --git a/docs/source/usage/general.md b/docs/source/usage/general.md
@@ -34,7 +34,7 @@ You can catch these errors and handle them as you wish!
 ```python
 for config in configs:  # e.g. hyper-parameter sweep
     try:
-        Launcher().run(train, config)
+        torchrunx.Launcher().run(train, config)
     except torch.cuda.OutOfMemoryError:
         print(f"{config} results in OOM... continuing...")
 ```
@@ -44,12 +44,12 @@ If you are expecting intermittent failures, you can catch errors and invoke retr
 ```python
 for retry in range(3):
     try:
-        Launcher().run(train, resume_from_checkpoint=True)
+        torchrunx.Launcher().run(train, resume_from_checkpoint=True)
     except torchrunx.WorkerFailedError as e:
         print(f"Error occurred: {e}")
         print(f"Retrying ({retry}) ...")
-    else:
-        break    
+    else:  # if run() is successful
+        break
 ```
 
 ## Environment variables
diff --git a/docs/source/usage/logging.md b/docs/source/usage/logging.md
@@ -1,37 +1,62 @@
 # Custom Logging
 
-We forward all worker and agent logs (i.e. from {mod}`logging`, {obj}`sys.stdout`, and {obj}`sys.stderr`) to the launcher for processing.
+We forward all agent and worker logs (i.e. from {mod}`logging`, {obj}`sys.stdout`, and {obj}`sys.stderr`) to the launcher process.
 
-By default, the logs from the rank 0 agent and worker are printed into the launcher's `stdout` stream. Logs from all agents and workers are written to a directory (by the current timestamp) in `$TORCHRUNX_LOG_DIR` (default: `./torchrunx_logs`).
+## Defaults
 
-You can fully customize how logs are processed using {func}`torchrunx.Launcher.set_logging_handlers`. You should provide it a function that constructs and returns a list of {obj}`logging.Handler` objects. Each {obj}`logging.Handler` controls where logs should be written.
+By default, the logs from the rank 0 agent and rank 0 worker are handled by loggers on the launcher process (and so they should be printed to `stdout`/`stderr`). You may control these logs like:
 
-We provide some handler utilities that direct a specified worker or agent's logs to a file or stream.
-
-```{eval-rst}
-.. autofunction:: torchrunx.utils.file_handler
+```python
+logging.basicConfig(level=logging.INFO)
+logging.getLogger("torchrunx").setLevel(logging.DEBUG)
+logging.getLogger("torchrunx.node1").setLevel(logging.INFO)
+logging.getLogger("torchrunx.node1.1").setLevel(logging.INFO)  # worker 1 (local rank) on node 1
 ```
 
-```{eval-rst}
-.. autofunction:: torchrunx.utils.stream_handler
-```
+Also, logs from all agents and workers are written to a directory (by the current timestamp) in `$TORCHRUNX_LOG_DIR` (default: `./torchrunx_logs`). These can be controlled using `$TORCHRUNX_LOG_LEVEL` (default: `INFO`).
 
-For example, we could construct and pass a handler factory that streams the rank 0 agent and worker logs to the launcher's `stdout`.
+## Customization
+
+You can fully customize how logs are processed using {func}`torchrunx.Launcher.set_logging_handlers`. You should provide it a factory function that constructs and returns a list of {obj}`logging.Handler` objects. Each {obj}`logging.Handler` controls where logs should be written. You can also add a filter to restrict the handler to the logs of a specific agent or worker.
+
+Here's an example:
 
 ```python
-def rank_0_handlers() -> list[logging.Handler]:
+from torchrunx.utils.log_handling import RedirectHandler, get_handler_filter
+
+def custom_handlers() -> list[logging.Handler]:
+
+    # Handler: redirect logs from (host 0, agent) to logger on launcher process
+    redirect_handler = RedirectHandler()
+    redirect_handler.addFilter(get_handler_filter(
+        hostname=hostnames[0], local_rank=None, log_level=logging.DEBUG
+    ))
+
+    # Handler: output logs from (host 0, worker 0) to "output.txt"
+    file_handler = logging.FileHandler("output.txt")
+    file_handler.addFilter(get_handler_filter(
+        hostname=hostnames[0], local_rank=0, log_level=logging.DEBUG
+    ))
+
     return [
-        stream_handler(hostname=hostnames[0], local_rank=None),  # agent 0
-        stream_handler(hostname=hostnames[0], local_rank=0),  # worker 0
+        redirect_handler,
+        file_handler,
     ]
 ```
 
 ```python
-torchrunx.Launcher(...).set_logging_handlers(rank_0_handlers).run(...)
+torchrunx.Launcher(...).set_logging_handlers(custom_handlers).run(...)
 ```
 
-You can also [provide your own ``logging.Handler``](https://docs.python.org/3.9/library/logging.handlers.html#module-logging.handlers) and apply {func}`torchrunx.utils.add_filter_to_handler` to constrain which worker or agent's logs it should process.
+Finally, you can control library-specific logging (within the worker processes) by modifying the distributed function:
+
+```python
+def distributed_function():
+    logging.getLogger("transformers").setLevel(logging.DEBUG)
+
+    logger = logging.getLogger("my_app")
+    logger.info("Hello world!")
+    ...
 
-```{eval-rst}
-.. autofunction:: torchrunx.utils.add_filter_to_handler
+torchrunx.Launcher(...).run(distributed_function)
 ```
diff --git a/docs/source/usage/slurm.md b/docs/source/usage/slurm.md
@@ -14,9 +14,8 @@ def distributed_training():
 
 if __name__ == "__main__":
     torchrunx.Launcher(
-        # optionally specify:
-        # hostnames = "slurm",
-        # workers_per_host = "gpu"
+        hostnames = "slurm",
+        workers_per_host = "gpu"
     ).run(distributed_training)
 ```
 
@@ -46,9 +45,8 @@ def distributed_training():
 
 def launch_training():
     torchrunx.Launcher(
-        # optionally specify:
-        # hostnames = "slurm",
-        # workers_per_host = "gpu"
+        hostnames = "slurm",
+        workers_per_host = "gpu"
     ).run(distributed_training)
 
 if __name__ == "__main__":
diff --git a/src/torchrunx/agent.py b/src/torchrunx/agent.py
@@ -19,7 +19,7 @@
     LauncherAgentGroup,
     get_open_port,
 )
-from .utils.logs import log_records_to_socket, redirect_stdio_to_logger
+from .utils.log_streaming import log_records_to_socket, redirect_stdio_to_logger
 from .worker import WorkerArgs, worker_entrypoint
 
 
diff --git a/src/torchrunx/launcher.py b/src/torchrunx/launcher.py
@@ -29,7 +29,8 @@
     resolve_environment,
 )
 from .utils.errors import ExceptionFromWorker, WorkerFailedError
-from .utils.logs import LoggingServerArgs, default_handlers, start_logging_server
+from .utils.log_handling import default_handlers
+from .utils.log_streaming import LoggingServerArgs, start_logging_server
 
 DEFAULT_ENV_VARS_FOR_COPY = (
     "PATH",
@@ -80,10 +81,9 @@ def set_logging_handlers(
     ) -> Self:
         """Provide a ``handler_factory`` function to customize processing of agent/worker logs.
 
-        See `Custom Logging <https://torchrun.xyz/features/logging.html>`_.
-
         Parameters:
           handler_factory: Function that constructs and returns :obj:`logging.Handler` objects.
+              See `Custom Logging <https://torchrun.xyz/usage/logging.html>`_ for more details.
         """
         self.handler_factory = handler_factory
         return self
diff --git a/src/torchrunx/utils/log_handling.py b/src/torchrunx/utils/log_handling.py
@@ -0,0 +1,108 @@
+"""Utilities for intercepting logs in worker processes and handling these in the Launcher."""
+
+from __future__ import annotations
+
+__all__ = [
+    "RedirectHandler",
+    "default_handlers",
+    "file_handlers",
+    "get_handler_filter",
+]
+
+import datetime
+import logging
+import os
+from logging import LogRecord
+from pathlib import Path
+from typing import Callable
+
+
+def get_handler_filter(
+    hostname: str,
+    local_rank: int | None,  # None indicates agent
+    log_level: int = logging.NOTSET,
+) -> Callable[[LogRecord], bool]:
+    """Get an agent- or worker- specific filter to apply to :obj:`logging.Handler`."""
+    return lambda record: (
+        record.hostname == hostname  # pyright: ignore [reportAttributeAccessIssue]
+        and record.local_rank == local_rank  # pyright: ignore [reportAttributeAccessIssue]
+        and record.levelno >= log_level
+    )
+
+
+class RedirectHandler(logging.Handler):
+    """For handling logs from hostname/rank with a corresponding logger in the launcher process."""
+
+    def emit(self, record: LogRecord) -> None:
+        """Handle log record using corresponding logger."""
+        logger = logging.getLogger(record.name)
+        if logger.isEnabledFor(record.levelno):
+            logger.handle(record)
+
+
+def file_handlers(
+    hostnames: list[str],
+    workers_per_host: list[int],
+    log_dir: str | os.PathLike = Path("torchrunx_logs"),
+    log_level: int = logging.NOTSET,
+) -> list[logging.Handler]:
+    """Handler builder function for writing logs for all workers/agents to a directory.
+
+    Files are named with hostname and the local_rank (for workers).
+    """
+    handlers = []
+
+    timestamp = datetime.datetime.now().isoformat(timespec="seconds")
+    log_dir = Path(log_dir) / timestamp
+    log_dir.mkdir(parents=True, exist_ok=True)
+
+    formatter = logging.Formatter(
+        "%(asctime)s:%(levelname)s: %(message)s", datefmt="%Y-%m-%d %H:%M:%S"
+    )
+
+    for hostname, num_workers in zip(hostnames, workers_per_host):
+        for local_rank in [None, *range(num_workers)]:
+            local_rank_str = f"[{local_rank}]" if local_rank is not None else ""
+            file_path = log_dir / f"{hostname}{local_rank_str}.log"
+
+            h = logging.FileHandler(file_path)
+            h.addFilter(get_handler_filter(hostname, local_rank, log_level=log_level))
+            h.setFormatter(formatter)
+
+            handlers.append(h)
+
+    return handlers
+
+
+def default_handlers(hostnames: list[str], workers_per_host: list[int]) -> list[logging.Handler]:
+    """Constructs default :obj:`logging.Handler` objects.
+
+    Logs for the rank 0 agent and rank 0 worker are redirected to loggers in the launcher process.
+    Logs for all hosts/workers are written to files in ``$TORCHRUNX_LOG_DIR`` (named by timestamp,
+    hostname, local_rank).
+    """
+    log_dir = Path(os.environ.get("TORCHRUNX_LOG_DIR", "torchrunx_logs"))
+
+    file_log_level = os.environ.get("TORCHRUNX_LOG_LEVEL", "INFO")
+    if file_log_level.isdigit():
+        file_log_level = int(file_log_level)
+    elif file_log_level in logging._nameToLevel:  # noqa: SLF001
+        file_log_level = logging._nameToLevel[file_log_level]  # noqa: SLF001
+    else:
+        msg = (
+            f"Invalid value for $TORCHRUNX_LOG_LEVEL: {file_log_level}. "
+            f"Should be a positive integer or any of: {', '.join(logging._nameToLevel.keys())}."  # noqa: SLF001
+        )
+        raise ValueError(msg)
+
+    redirect_agent_0_handler = RedirectHandler()
+    redirect_agent_0_handler.addFilter(get_handler_filter(hostnames[0], None))
+
+    redirect_worker_0_handler = RedirectHandler()
+    redirect_worker_0_handler.addFilter(get_handler_filter(hostnames[0], 0))
+
+    return [
+        redirect_agent_0_handler,
+        redirect_worker_0_handler,
+        *file_handlers(hostnames, workers_per_host, log_dir=log_dir, log_level=file_log_level),
+    ]
diff --git a/src/torchrunx/utils/log_streaming.py b/src/torchrunx/utils/log_streaming.py
diff --git a/src/torchrunx/worker.py b/src/torchrunx/worker.py

Original file line number	Diff line number	Diff line change
`@@ -19,7 +19,7 @@`
`19`	`19`	`LauncherAgentGroup,`
`20`	`20`	`get_open_port,`
`21`	`21`	`)`
`22`		`-from .utils.logs import log_records_to_socket, redirect_stdio_to_logger`
	`22`	`+from .utils.log_streaming import log_records_to_socket, redirect_stdio_to_logger`
`23`	`23`	`from .worker import WorkerArgs, worker_entrypoint`
`24`	`24`
`25`	`25`