End-to-end examples for running AI/ML workloads on Azure Kubernetes Service (AKS) with GPU acceleration, powered by KubeRay and Karpenter.
This repository demonstrates a hybrid multi-cloud architecture where an AKS control plane manages GPU nodes across both Azure and Nebius Cloud, connected via VPN. All examples support both cloud providers through Kustomize overlays.
Key infrastructure components:
- Flex Karpenter for automatic node provisioning and autoscaling
- Kubernetes DRA (Dynamic Resource Allocation) with NVIDIA DRA driver for topology-aware GPU scheduling
- NVIDIA H100 80GB GPU instances on both Azure and Nebius
- KubeRay operator for managing Ray clusters on Kubernetes
| Example | Description |
|---|---|
| Distributed Inference | Benchmark LLM inference throughput and latency using Ray Data LLM with vLLM. Default model: Qwen2.5-7B-Instruct. |
| Fine-Tuning | LoRA SFT on Qwen2.5-7B-Instruct using Ray Train and LLaMA-Factory for entity recognition on the Viggo dataset. |
| Example | Description |
|---|---|
| Batch Inference | Generate CLIP image embeddings at scale using Ray Data with GPU actors, with cosine similarity search. |
| Distributed Training | Train an image classifier on CLIP embeddings using Ray Train with PyTorch DDP and MLflow tracking. |
| Example | Description |
|---|---|
| Autoscaling | Karpenter node pool configurations for automatic CPU and GPU node provisioning on Azure and Nebius. |
Each example follows a consistent layout:
main.py- Application entry pointrun.sh- One-command launcherbase/- Kustomize base manifests (RayJob + DRA ResourceClaimTemplate)overlays/{azure,nebius}/- Cloud-specific Kustomize patches
- An AKS cluster with GPU node pools (NVIDIA H100 recommended)
- KubeRay operator v1.5.1+ installed
- NVIDIA DRA driver deployed for GPU scheduling
- Karpenter enabled (for autoscaling examples)
kubectlandkustomizeCLI tools
- Ensure your AKS cluster and prerequisites are configured.
- Navigate to the example you want to run (e.g.,
examples/llm/distributed-inferencing/). - Review the example's README for specific configuration details.
- Run the example:
# Example: LLM Inference on Azure
./examples/llm/distributed-inferencing/run.sh azureEach example's run.sh script handles applying the Kustomize manifests and submitting the RayJob to the cluster.
| Category | Technologies |
|---|---|
| Cloud Platforms | Azure (AKS), Nebius Cloud |
| Distributed Computing | Ray 2.48–2.53, KubeRay, Ray Data, Ray Train |
| LLM Inference | vLLM, Ray Data LLM |
| LLM Fine-Tuning | LLaMA-Factory (LoRA SFT) |
| Models | Qwen2.5-7B-Instruct, OpenAI CLIP |
| ML Frameworks | PyTorch, HuggingFace Transformers |
| Experiment Tracking | MLflow |
| GPU Scheduling | Kubernetes DRA, NVIDIA DRA Driver |
| Autoscaling | Flex Karpenter |
| GPU Hardware | NVIDIA H100 80GB HBM3 |