-
Notifications
You must be signed in to change notification settings - Fork 7.3k
[Train] [doc] Local mode user guide #57751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
f7f30c1
8b6ef43
10b5740
f1ecf98
d5f6d10
13a5eda
5fbfd6c
70841f6
6cd23de
4f1684a
1eef009
75f6856
238211a
c6ea43e
caceb59
47852ab
965d908
4eca270
ff290f6
2d35380
e957011
336b962
5ec2bfa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,338 @@ | ||
| .. _train-local-mode: | ||
|
|
||
| Local Mode | ||
| ========== | ||
|
|
||
| .. important:: | ||
| This user guide shows how to use local mode for Ray Train V2 only. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| **This user guide assumes that you're using Ray Train V2 with the environment variable ``RAY_TRAIN_V2_ENABLED=1`` enabled.** | ||
|
|
||
| Local mode in Ray Train allows you to run your training function directly in the local process. | ||
| This is useful for testing, development, and quick iterations. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| When to use local mode | ||
| ---------------------- | ||
|
|
||
| Local mode is particularly useful in the following scenarios: | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| * **Isolated testing**: Test your training function without the Ray Train framework overhead, making it easier to locate issues in your training logic. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * **Unit testing**: Test your training logic quickly without the overhead of launching a Ray cluster. | ||
| * **Development**: Iterate rapidly on your training function before scaling to distributed training. | ||
| * **Multi-worker training with torchrun**: Launch multi-GPU training jobs using torchrun directly, making it easier to transfer from other frameworks. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| .. note:: | ||
| In local mode, Ray Train doesn't launch Ray Train worker actors, but your code may still launch | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| other Ray actors if needed. Local mode works with both single-process and multi-process training | ||
| (through torchrun). | ||
|
|
||
| Enabling local mode | ||
| ------------------- | ||
|
|
||
| To enable local mode, set ``num_workers=0`` in your :class:`~ray.train.ScalingConfig`: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from ray.train import ScalingConfig | ||
| from ray.train.torch import TorchTrainer | ||
|
|
||
| def train_func(config): | ||
| # Your training logic | ||
| pass | ||
|
|
||
| trainer = TorchTrainer( | ||
| train_loop_per_worker=train_func, | ||
| scaling_config=ScalingConfig(num_workers=0), | ||
| ) | ||
| result = trainer.fit() | ||
|
|
||
| This runs your training function in the main process without launching Ray Train worker actors. | ||
|
|
||
| Single-process local mode | ||
| ------------------------- | ||
|
|
||
| The following example demonstrates single-process local mode with PyTorch: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import torch | ||
| from torch import nn | ||
| import ray | ||
| from ray.train import ScalingConfig | ||
| from ray.train.torch import TorchTrainer | ||
|
|
||
| def train_func(config): | ||
| model = nn.Linear(10, 1) | ||
| optimizer = torch.optim.SGD(model.parameters(), lr=config["lr"]) | ||
|
|
||
| for epoch in range(config["epochs"]): | ||
| # Training loop | ||
| loss = model(torch.randn(32, 10)).sum() | ||
| loss.backward() | ||
| optimizer.step() | ||
|
|
||
| # Report metrics | ||
| ray.train.report({"loss": loss.item()}) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bug: Gradient Accumulation in Training LoopThe training example is missing |
||
|
|
||
| trainer = TorchTrainer( | ||
| train_loop_per_worker=train_func, | ||
| train_loop_config={"lr": 0.01, "epochs": 3}, | ||
| scaling_config=ScalingConfig(num_workers=0), | ||
| ) | ||
| result = trainer.fit() | ||
| print(f"Final loss: {result.metrics['loss']}") | ||
|
|
||
| .. note:: | ||
| Local mode works with all Ray Train framework integrations, including PyTorch Lightning, | ||
| Hugging Face Transformers, LightGBM, XGBoost, TensorFlow, and others. | ||
|
|
||
| Multi-process local mode with torchrun | ||
| --------------------------------------- | ||
|
|
||
| Local mode supports multi-GPU training through torchrun. This allows you to launch distributed training | ||
| across multiple GPUs or nodes while still using Ray but not the Ray Train framework, isolating your | ||
| training logic from the framework overhead. | ||
|
|
||
| Single-node multi-GPU training | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The following example shows how to use torchrun with local mode for multi-GPU training on a single node. | ||
| This example uses standard PyTorch DataLoader for data loading, making it easy to adapt existing PyTorch training code. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| First, create your training script (``train_script.py``): | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import os | ||
| import tempfile | ||
| import torch | ||
| import torch.distributed as dist | ||
| from torch import nn | ||
| from torch.utils.data import DataLoader | ||
| from torchvision.datasets import FashionMNIST | ||
| from torchvision.transforms import ToTensor, Normalize, Compose | ||
| from filelock import FileLock | ||
| import ray | ||
| from ray.train import Checkpoint, ScalingConfig, get_context | ||
| from ray.train.torch import TorchTrainer | ||
|
|
||
| def train_func(config): | ||
| # Load dataset with file locking to avoid multiple downloads | ||
| transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))]) | ||
| data_dir = "./data" | ||
|
|
||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| # Only local rank 0 downloads the dataset | ||
| local_rank = get_context().get_local_rank() | ||
| if local_rank == 0: | ||
| with FileLock(os.path.join(data_dir, "fashionmnist.lock")): | ||
| train_dataset = FashionMNIST( | ||
| root=data_dir, train=True, download=True, transform=transform | ||
| ) | ||
|
|
||
| # Wait for rank 0 to finish downloading | ||
| dist.barrier() | ||
|
|
||
| # Now all ranks can safely load the dataset | ||
| train_dataset = FashionMNIST( | ||
| root=data_dir, train=True, download=False, transform=transform | ||
| ) | ||
| train_loader = DataLoader( | ||
| train_dataset, batch_size=config["batch_size"], shuffle=True | ||
| ) | ||
|
|
||
| # Prepare dataloader for distributed training | ||
| train_loader = ray.train.torch.prepare_data_loader(train_loader) | ||
|
|
||
| # Prepare model for distributed training | ||
| model = nn.Sequential( | ||
| nn.Flatten(), | ||
| nn.Linear(28 * 28, 128), | ||
| nn.ReLU(), | ||
| nn.Linear(128, 10) | ||
| ) | ||
| model = ray.train.torch.prepare_model(model) | ||
|
|
||
| criterion = nn.CrossEntropyLoss() | ||
| optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"]) | ||
|
|
||
| # Training loop | ||
| for epoch in range(config["epochs"]): | ||
| # Set epoch for distributed sampler | ||
| if ray.train.get_context().get_world_size() > 1: | ||
| train_loader.sampler.set_epoch(epoch) | ||
|
|
||
| epoch_loss = 0.0 | ||
| for batch_idx, (images, labels) in enumerate(train_loader): | ||
| outputs = model(images) | ||
| loss = criterion(outputs, labels) | ||
|
|
||
| optimizer.zero_grad() | ||
| loss.backward() | ||
| optimizer.step() | ||
|
|
||
| epoch_loss += loss.item() | ||
|
|
||
| avg_loss = epoch_loss / len(train_loader) | ||
|
|
||
| # Report metrics and checkpoint | ||
| with tempfile.TemporaryDirectory() as temp_dir: | ||
| torch.save(model.state_dict(), os.path.join(temp_dir, "model.pt")) | ||
| ray.train.report( | ||
| {"loss": avg_loss, "epoch": epoch}, | ||
| checkpoint=Checkpoint.from_directory(temp_dir) | ||
| ) | ||
|
|
||
| # Configure trainer for local mode | ||
| trainer = TorchTrainer( | ||
| train_loop_per_worker=train_func, | ||
| train_loop_config={"lr": 0.001, "epochs": 10, "batch_size": 32}, | ||
| scaling_config=ScalingConfig(num_workers=0, use_gpu=True), | ||
| ) | ||
| result = trainer.fit() | ||
|
|
||
| Then, launch the training with torchrun: | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # Train on 4 GPUs on a single node | ||
| RAY_TRAIN_V2_ENABLED=1 torchrun --nproc-per-node=4 train_script.py | ||
|
|
||
| Ray Train automatically detects the torchrun environment variables and configures the distributed | ||
| training accordingly. You can access distributed training information through :func:`ray.train.get_context()`: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| from ray.train import get_context | ||
|
|
||
| context = get_context() | ||
| print(f"World size: {context.get_world_size()}") | ||
| print(f"World rank: {context.get_world_rank()}") | ||
| print(f"Local rank: {context.get_local_rank()}") | ||
|
|
||
| Multi-node multi-GPU training | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| You can also use torchrun to launch multi-node training with local mode. The following example shows | ||
| how to launch training across 2 nodes with 4 GPUs each: | ||
|
|
||
| On the master node (``192.168.1.1``): | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| RAY_TRAIN_V2_ENABLED=1 torchrun \ | ||
| --nnodes=2 \ | ||
| --nproc-per-node=4 \ | ||
| --node_rank=0 \ | ||
| --rdzv_backend=c10d \ | ||
| --rdzv_endpoint=192.168.1.1:29500 \ | ||
| --rdzv_id=job_id \ | ||
| train_script.py | ||
|
|
||
| On the worker node: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| RAY_TRAIN_V2_ENABLED=1 torchrun \ | ||
| --nnodes=2 \ | ||
| --nproc-per-node=4 \ | ||
| --node_rank=1 \ | ||
| --rdzv_backend=c10d \ | ||
| --rdzv_endpoint=192.168.1.1:29500 \ | ||
| --rdzv_id=job_id \ | ||
| train_script.py | ||
|
|
||
| .. note:: | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| When using torchrun with local mode, Ray Data isn't supported. The training relies on standard | ||
| PyTorch data loading mechanisms. | ||
|
|
||
| Using local mode with Ray Data | ||
| ------------------------------- | ||
|
|
||
| Single-process local mode works seamlessly with Ray Data for data loading and preprocessing. | ||
| The data is processed and provided to your training function without distributed workers. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The following example shows how to use Ray Data with single-process local mode: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import ray | ||
| from ray.train import ScalingConfig | ||
| from ray.train.torch import TorchTrainer | ||
|
|
||
| def train_func(config): | ||
| # Get the dataset shard | ||
| train_dataset = ray.train.get_dataset_shard("train") | ||
|
|
||
| # Iterate over batches | ||
| for batch in train_dataset.iter_batches(batch_size=32): | ||
| # Training logic | ||
| pass | ||
|
|
||
| # Create a Ray Dataset | ||
| dataset = ray.data.read_csv("s3://bucket/data.csv") | ||
|
|
||
| trainer = TorchTrainer( | ||
| train_loop_per_worker=train_func, | ||
| scaling_config=ScalingConfig(num_workers=0), | ||
| datasets={"train": dataset}, | ||
| ) | ||
| result = trainer.fit() | ||
|
|
||
| .. note:: | ||
| Ray Data isn't supported when using torchrun with local mode for multi-process training. | ||
|
|
||
| Testing with local mode | ||
| ----------------------- | ||
|
|
||
| Local mode is excellent for unit testing your training logic. The following example shows how to write | ||
| a unit test with local mode: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import unittest | ||
| import ray | ||
| from ray.train import ScalingConfig | ||
| from ray.train.torch import TorchTrainer | ||
|
|
||
| class TestTraining(unittest.TestCase): | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| def test_training_runs(self): | ||
| def train_func(config): | ||
| # Minimal training logic | ||
| ray.train.report({"loss": 0.5}) | ||
|
|
||
| trainer = TorchTrainer( | ||
| train_loop_per_worker=train_func, | ||
| scaling_config=ScalingConfig(num_workers=0), | ||
| ) | ||
| result = trainer.fit() | ||
|
|
||
| self.assertIsNone(result.error) | ||
| self.assertEqual(result.metrics["loss"], 0.5) | ||
|
|
||
| if __name__ == "__main__": | ||
| unittest.main() | ||
|
|
||
| Limitations | ||
| ----------- | ||
|
|
||
| Local mode has the following limitations: | ||
|
|
||
| * **No Ray Train fault tolerance**: Worker-level fault tolerance doesn't apply in local mode. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * **Ray Data limitations**: Ray Data isn't supported when using torchrun with local mode for multi-process training. | ||
| * **Different behavior**: Some framework-specific behaviors might differ slightly from Ray Train's distributed mode. | ||
xinyuangui2 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Transitioning from local mode to distributed training | ||
| ----------------------------------------------------- | ||
|
|
||
| When you're ready to scale from local mode to distributed training, simply change ``num_workers`` | ||
| to a value greater than 0: | ||
|
|
||
| .. code-block:: diff | ||
|
|
||
| trainer = TorchTrainer( | ||
| train_loop_per_worker=train_func, | ||
| train_loop_config=config, | ||
xinyuangui2 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - scaling_config=ScalingConfig(num_workers=0), | ||
| + scaling_config=ScalingConfig(num_workers=4, use_gpu=True), | ||
| ) | ||
|
|
||
| Your training function code remains the same, and Ray Train handles the distributed coordination automatically. | ||
Uh oh!
There was an error while loading. Please reload this page.