Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Copyright (c) 2025 Oracle and/or its affiliates.

The Universal Permissive License (UPL), Version 1.0

Subject to the condition set forth below, permission is hereby granted to any
person obtaining a copy of this software, associated documentation and/or data
(collectively the "Software"), free of charge and under any and all copyright
rights in the Software, and any and all patent rights owned or freely
licensable by each licensor hereunder covering either (i) the unmodified
Software as contributed to or provided by such licensor, or (ii) the Larger
Works (as defined below), to deal in both

(a) the Software, and
(b) any piece of software and/or hardware listed in the lrgrwrks.txt file if
one is included with the Software (each a "Larger Work" to which the Software
is contributed by such licensors),

without restriction, including without limitation the rights to copy, create
derivative works of, display, perform, and distribute the Software and make,
use, sell, offer for sale, import, export, have made, and have sold the
Software and the Larger Work(s), and to sublicense the foregoing rights on
either these or other terms.

This license is subject to the following condition:
The above copyright notice and either this complete permission notice or at
a minimum a reference to the UPL must be included in all copies or
substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Overview

This repository provides a step-by-step deployment of DeepSpeed training for Large Language Models (LLMs) on Oracle Cloud Infrastructure (OCI), using H100 GPU clusters with RDMA and SLURM.

This setup includes a tuned DeepSpeed configuration (`tuned_ds_config.json`) that provides up to **13% performance improvement** over standard configurations.

Reviewed: 06.06.2025
# When to use this asset?

Use this asset when you need to:
- Train large-scale language models on OCI with H100 hardware.
- Utilize RDMA-enabled SLURM clusters for distributed multi-node DeepSpeed training.
- Achieve improved throughput via custom-tuned DeepSpeed JSON configs.

# How to use this asset?
- deploy OCI HPC stack with H100 multiple instances.
- Improve training performance by using a tuned configuration for the deepspeed LLM model.

## Prerequisites & Docs

### Prerequisites

* An OCI tenancy with H100 GPU quota (shape: BM.GPU.H100.8).
* A [Huggingface](https://huggingface.co/) account with a valid Auth Token.
* SSH access to the deployed head node of your SLURM cluster.

### Documentation & Resources

* [DeepSpeed Documentation](https://www.deepspeed.ai/docs/)
* [TinyLlama Model (HF)](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
* [Mistral LLMs](https://mistral.ai/technology/#models)
* [OCI HPC Stack](https://github.com/oracle-quickstart/oci-hpc)

## Model Training Workflow
- please refer files/README.md for more details

### Instance Configuration

The deployment uses a cluster of `BM.GPU.H100.8` bare metal instances, provisioned with cluster networking and RDMA.

The DeepSpeed job is submitted via SLURM using the `run_deepspeed.slurm` script. The environment includes a shared OCI File Storage System mounted on all nodes.

### DeepSpeed Tuned Configuration

The `tuned_ds_config.json` applies the following optimizations:
- Switched from fp16 to bf16 (optimal for H100)
- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes
- Used gradient_accumulation_steps=8 to balance memory use and throughput
- Tweaked aio settings for better I/O performance during training
- Removed optimizer/parameter offloading to fully utilize GPU RA

These optimizations are benchmarked to deliver up to **13% faster training throughput** on OCI H100 clusters.

### Launch Training Job

Submit your training job using SLURM:

```bash
sbatch $HOME$/scripts/run_deepspeed.slurm
```

The job script uses:
- `train.py`: your LLM training script
- `tuned_ds_config.json`: DeepSpeed configuration file
- Local datasets and Hugging Face model/tokenizer

### Example curl Test (after model fine-tuning)

To serve the trained model via OpenAI-compatible API:

```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"prompt": "A GPU is a",
"max_tokens": 128,
"temperature": 0.7
}'
```

## Notes

To train larger models like Mixtral or Mistral 7B on H100, make sure to:
- Scale the number of nodes appropriately
- Use quantization or tensor parallelism when needed
- Ensure models and datasets fit into GPU memory with DeepSpeed ZeRO optimization

# Acknowledgments

- **Author** - Deepak Soni (GPU Black Belt)

# License

Copyright (c) 2025 Oracle and/or its affiliates.

Licensed under the Universal Permissive License (UPL), Version 1.0.

See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# DeepSpeed LLM Training on OCI H100 SLURM Cluster

This repository automates deployment of a multi-node SLURM cluster with RDMA-enabled H100 GPUs on OCI for training large language models using DeepSpeed.

## Tuned Configuration

We developed a custom-tuned deepspeed_config.json tailored for:
- Multi-node training
- RDMA-aware NCCL backend
- H100’s bfloat16-optimized tensor cores
- DeepSpeed ZeRO Stage 2 with communication overlap

The `tuned_ds_config.json` includes:
- Switched from fp16 to bf16 (optimal for H100)
- Enabled overlap_comm, contiguous_gradients, and increased bucket sizes
- Used gradient_accumulation_steps=8 to balance memory use and throughput
- Tweaked aio settings for better I/O performance during training
- Removed optimizer/parameter offloading to fully utilize GPU RAM


This configuration delivers up to **13% more training throughput** versus default settings on OCI H100 infrastructure.

## With this updated configuration:
- Training throughput improved by ~13%
- GPU utilization increased more consistently across all 8 nodes
- Communication latency reduced on RDMA fabric
- No stability or memory issues observed with ZeRO Stage 2

## 📂 Contents

- `scripts/tuned_ds_config.json` – optimized DeepSpeed configuration
- `scripts/run_deepspeed.slurm` – job script for SLURM
- `README.md` – usage overview and tuning explanation

## Usage

1. Deploy SLURM H100 cluster on OCI
2. SSH to master node
3. Submit the job:

```bash
sbatch run_deepspeed.slurm
```

Model output and logs will be written to `$HOME/output`.

## Conclusion
- NCCL tuning alone isn’t always sufficient — framework-level configuration (DeepSpeed) must align with hardware.
- H100 GPUs benefit significantly from bfloat16 and increased comm overlap.
- ZeRO Stage 2 provided a solid balance of memory efficiency and speed. ZeRO-3 is reserved for future scaling.
- System-aware configuration (bucket sizes, threading, and memory layout) is essential for reaching peak performance.

## Next Steps
- Benchmark with ZeRO Stage 3 for models approaching GPU memory limits.
- Test pipeline parallelism on >16 node jobs.
- Evaluate DeepSpeed 0.13+ features such as NVMe offloading and optimizer fusion on upcoming jobs.
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/bin/bash

set -ex

source myenv/bin/activate

## NCCL parameters configuration based on OCI H100 GPU Instance deployment
export NCCL_TIMEOUT=1800

export NCCL_IGNORE_CPU_AFFINITY=1
export OMPI_MCA_coll_hcol_enable=0
export NCCL_CROSS_NIC=2
export NCCL_SOCKET_NTHREADS=16
export NCCL_DEBUG=DEBUG
export NCCL_CUMEM_ENABLE=0
export NCCL_IB_SPLIT_DATA_ON_QPS=0
export NCCL_IB_QPS_PER_CONNECTION=16
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA="mlx5_0,mlx5_1,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17"
export NCCL_IB_TC=41
export NCCL_IB_SL=0
export NCCL_IB_TIMEOUT=22
export HCOLL_ENABLE_MCAST_ALL=0
export UCX_TLS=tcp
export UCX_NET_DEVICES=eth0
export RX_QUEUE_LEN=8192
export NCCL_SOCKET_IFNAME=eth0

export OMP_NUM_THREADS=16 # should be optimally number of CPU cores / number of GPUs per node

export GPUS_PER_NODE=8
MASTER_NODE=$(scontrol show hostname | head -n 1)
export MASTER_ADDR=$(scontrol show node=$MASTER_NODE | awk -F= '/NodeAddr=/{print $2}' | awk '{print $1}')
export NNODES=$SLURM_NTASKS
export NODE_RANK=$SLURM_NODEID
export MASTER_PORT=9001
export WORLD_SIZE_JOB=$SLURM_NTASKS
export DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT "

torchrun $DISTRIBUTED_ARGS \
train.py \
--model_config tuned_ds_config.json \
--tokenizer_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--dataset_mixer data_mixer.json \
--dataset_name mix \
--dataset_type local \
--dataset_packed \
--batch_size 12 \
--gradient_checkpointing \
--max_train_steps 1000000 \
--val_after_steps 10000 \
--num_warmup_steps 10000 \
--learning_rate 1e-4 \
--num_gpus_node $GPUS_PER_NODE \
--gradient_clipping 1 \
--gradient_accumulation_steps 2 \
--dataset_cache "./hf-cache"

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

#SBATCH --nodes=4
#SBATCH --job-name=deepspeed-performance-test
#SBATCH --exclusive
srun -l exec_torchrun.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
{
"train_batch_size": 2048,
"train_micro_batch_size_per_gpu": 32,
"gradient_accumulation_steps": 8,
"steps_per_print": 100,
"wall_clock_breakdown": false,

"bf16": {
"enabled": true
},

"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-4,
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0.01
}
},

"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-4,
"warmup_num_steps": 10000
}
},

"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"overlap_comm": true,
"contiguous_gradients": true
},

"gradient_clipping": 1.0,

"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": true,
"cpu_checkpointing": false,
"number_checkpoints": null
},

"aio": {
"block_size": 1048576,
"queue_depth": 16,
"single_submit": false,
"overlap_events": true
},

"flops_profiler": {
"enabled": false,
"profile_step": 10,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},

"elasticity": {
"enabled": false
},

"gradient_accumulation_plugin": {
"enabled": true
}
}