Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
f7f30c1
local mode user guide
xinyuangui2 Oct 15, 2025
8b6ef43
update use case
xinyuangui2 Oct 15, 2025
10b5740
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 15, 2025
f1ecf98
Merge branch 'master' into local-mode-user-guide
xinyuangui2 Oct 15, 2025
d5f6d10
update code block
xinyuangui2 Oct 15, 2025
13a5eda
fix
xinyuangui2 Oct 15, 2025
5fbfd6c
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 16, 2025
70841f6
Merge branch 'master' into local-mode-user-guide
xinyuangui2 Oct 22, 2025
6cd23de
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 22, 2025
4f1684a
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 22, 2025
1eef009
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 22, 2025
75f6856
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 22, 2025
238211a
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 22, 2025
c6ea43e
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 22, 2025
caceb59
resolve comments
xinyuangui2 Oct 22, 2025
47852ab
resolve comments
xinyuangui2 Oct 22, 2025
965d908
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 27, 2025
4eca270
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 27, 2025
ff290f6
Update doc/source/train/user-guides/local_mode.rst
xinyuangui2 Oct 27, 2025
2d35380
Merge branch 'master' into local-mode-user-guide
xinyuangui2 Oct 27, 2025
e957011
resolve comments
xinyuangui2 Oct 27, 2025
336b962
update
xinyuangui2 Oct 27, 2025
5ec2bfa
Merge branch 'master' into local-mode-user-guide
xinyuangui2 Oct 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/train/user-guides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Ray Train User Guides

user-guides/data-loading-preprocessing
user-guides/using-gpus
user-guides/local_mode
user-guides/persistent-storage
user-guides/monitoring-logging
user-guides/checkpoints
Expand Down
338 changes: 338 additions & 0 deletions doc/source/train/user-guides/local_mode.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,338 @@
.. _train-local-mode:

Local Mode
==========

.. important::
This user guide shows how to use local mode for Ray Train V2 only.
**This user guide assumes that you're using Ray Train V2 with the environment variable ``RAY_TRAIN_V2_ENABLED=1`` enabled.**

Local mode in Ray Train allows you to run your training function directly in the local process.
This is useful for testing, development, and quick iterations.

When to use local mode
----------------------

Local mode is particularly useful in the following scenarios:

* **Isolated testing**: Test your training function without the Ray Train framework overhead, making it easier to locate issues in your training logic.
* **Unit testing**: Test your training logic quickly without the overhead of launching a Ray cluster.
* **Development**: Iterate rapidly on your training function before scaling to distributed training.
* **Multi-worker training with torchrun**: Launch multi-GPU training jobs using torchrun directly, making it easier to transfer from other frameworks.

.. note::
In local mode, Ray Train doesn't launch Ray Train worker actors, but your code may still launch
other Ray actors if needed. Local mode works with both single-process and multi-process training
(through torchrun).

Enabling local mode
-------------------

To enable local mode, set ``num_workers=0`` in your :class:`~ray.train.ScalingConfig`:

.. code-block:: python

from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_func(config):
# Your training logic
pass

trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=ScalingConfig(num_workers=0),
)
result = trainer.fit()

This runs your training function in the main process without launching Ray Train worker actors.

Single-process local mode
-------------------------

The following example demonstrates single-process local mode with PyTorch:

.. code-block:: python

import torch
from torch import nn
import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_func(config):
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=config["lr"])

for epoch in range(config["epochs"]):
# Training loop
loss = model(torch.randn(32, 10)).sum()
loss.backward()
optimizer.step()

# Report metrics
ray.train.report({"loss": loss.item()})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Gradient Accumulation in Training Loop

The training example is missing optimizer.zero_grad() before loss.backward(). This causes gradients to accumulate across epochs instead of being reset, leading to incorrect training behavior. The call should be added before line 88 (loss.backward()). This is inconsistent with the correct implementation shown in the multi-GPU training example at lines 257-259.

Fix in CursorΒ Fix in Web


trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config={"lr": 0.01, "epochs": 3},
scaling_config=ScalingConfig(num_workers=0),
)
result = trainer.fit()
print(f"Final loss: {result.metrics['loss']}")

.. note::
Local mode works with all Ray Train framework integrations, including PyTorch Lightning,
Hugging Face Transformers, LightGBM, XGBoost, TensorFlow, and others.

Multi-process local mode with torchrun
---------------------------------------

Local mode supports multi-GPU training through torchrun. This allows you to launch distributed training
across multiple GPUs or nodes while still using Ray but not the Ray Train framework, isolating your
training logic from the framework overhead.

Single-node multi-GPU training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following example shows how to use torchrun with local mode for multi-GPU training on a single node.
This example uses standard PyTorch DataLoader for data loading, making it easy to adapt existing PyTorch training code.

First, create your training script (``train_script.py``):

.. code-block:: python

import os
import tempfile
import torch
import torch.distributed as dist
from torch import nn
from torch.utils.data import DataLoader
from torchvision.datasets import FashionMNIST
from torchvision.transforms import ToTensor, Normalize, Compose
from filelock import FileLock
import ray
from ray.train import Checkpoint, ScalingConfig, get_context
from ray.train.torch import TorchTrainer

def train_func(config):
# Load dataset with file locking to avoid multiple downloads
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
data_dir = "./data"

# Only local rank 0 downloads the dataset
local_rank = get_context().get_local_rank()
if local_rank == 0:
with FileLock(os.path.join(data_dir, "fashionmnist.lock")):
train_dataset = FashionMNIST(
root=data_dir, train=True, download=True, transform=transform
)

# Wait for rank 0 to finish downloading
dist.barrier()

# Now all ranks can safely load the dataset
train_dataset = FashionMNIST(
root=data_dir, train=True, download=False, transform=transform
)
train_loader = DataLoader(
train_dataset, batch_size=config["batch_size"], shuffle=True
)

# Prepare dataloader for distributed training
train_loader = ray.train.torch.prepare_data_loader(train_loader)

# Prepare model for distributed training
model = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
model = ray.train.torch.prepare_model(model)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])

# Training loop
for epoch in range(config["epochs"]):
# Set epoch for distributed sampler
if ray.train.get_context().get_world_size() > 1:
train_loader.sampler.set_epoch(epoch)

epoch_loss = 0.0
for batch_idx, (images, labels) in enumerate(train_loader):
outputs = model(images)
loss = criterion(outputs, labels)

optimizer.zero_grad()
loss.backward()
optimizer.step()

epoch_loss += loss.item()

avg_loss = epoch_loss / len(train_loader)

# Report metrics and checkpoint
with tempfile.TemporaryDirectory() as temp_dir:
torch.save(model.state_dict(), os.path.join(temp_dir, "model.pt"))
ray.train.report(
{"loss": avg_loss, "epoch": epoch},
checkpoint=Checkpoint.from_directory(temp_dir)
)

# Configure trainer for local mode
trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config={"lr": 0.001, "epochs": 10, "batch_size": 32},
scaling_config=ScalingConfig(num_workers=0, use_gpu=True),
)
result = trainer.fit()

Then, launch the training with torchrun:

.. code-block:: bash

# Train on 4 GPUs on a single node
RAY_TRAIN_V2_ENABLED=1 torchrun --nproc-per-node=4 train_script.py

Ray Train automatically detects the torchrun environment variables and configures the distributed
training accordingly. You can access distributed training information through :func:`ray.train.get_context()`:

.. code-block:: python

from ray.train import get_context

context = get_context()
print(f"World size: {context.get_world_size()}")
print(f"World rank: {context.get_world_rank()}")
print(f"Local rank: {context.get_local_rank()}")

Multi-node multi-GPU training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also use torchrun to launch multi-node training with local mode. The following example shows
how to launch training across 2 nodes with 4 GPUs each:

On the master node (``192.168.1.1``):

.. code-block:: bash

RAY_TRAIN_V2_ENABLED=1 torchrun \
--nnodes=2 \
--nproc-per-node=4 \
--node_rank=0 \
--rdzv_backend=c10d \
--rdzv_endpoint=192.168.1.1:29500 \
--rdzv_id=job_id \
train_script.py

On the worker node:

.. code-block:: bash

RAY_TRAIN_V2_ENABLED=1 torchrun \
--nnodes=2 \
--nproc-per-node=4 \
--node_rank=1 \
--rdzv_backend=c10d \
--rdzv_endpoint=192.168.1.1:29500 \
--rdzv_id=job_id \
train_script.py

.. note::
When using torchrun with local mode, Ray Data isn't supported. The training relies on standard
PyTorch data loading mechanisms.

Using local mode with Ray Data
-------------------------------

Single-process local mode works seamlessly with Ray Data for data loading and preprocessing.
The data is processed and provided to your training function without distributed workers.

The following example shows how to use Ray Data with single-process local mode:

.. code-block:: python

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_func(config):
# Get the dataset shard
train_dataset = ray.train.get_dataset_shard("train")

# Iterate over batches
for batch in train_dataset.iter_batches(batch_size=32):
# Training logic
pass

# Create a Ray Dataset
dataset = ray.data.read_csv("s3://bucket/data.csv")

trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=ScalingConfig(num_workers=0),
datasets={"train": dataset},
)
result = trainer.fit()

.. note::
Ray Data isn't supported when using torchrun with local mode for multi-process training.

Testing with local mode
-----------------------

Local mode is excellent for unit testing your training logic. The following example shows how to write
a unit test with local mode:

.. code-block:: python

import unittest
import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

class TestTraining(unittest.TestCase):
def test_training_runs(self):
def train_func(config):
# Minimal training logic
ray.train.report({"loss": 0.5})

trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=ScalingConfig(num_workers=0),
)
result = trainer.fit()

self.assertIsNone(result.error)
self.assertEqual(result.metrics["loss"], 0.5)

if __name__ == "__main__":
unittest.main()

Limitations
-----------

Local mode has the following limitations:

* **No Ray Train fault tolerance**: Worker-level fault tolerance doesn't apply in local mode.
* **Ray Data limitations**: Ray Data isn't supported when using torchrun with local mode for multi-process training.
* **Different behavior**: Some framework-specific behaviors might differ slightly from Ray Train's distributed mode.

Transitioning from local mode to distributed training
-----------------------------------------------------

When you're ready to scale from local mode to distributed training, simply change ``num_workers``
to a value greater than 0:

.. code-block:: diff

trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config=config,
- scaling_config=ScalingConfig(num_workers=0),
+ scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
)

Your training function code remains the same, and Ray Train handles the distributed coordination automatically.