This repository is a hands-on MLOps learning exercise that deploys a complete ML training pipeline. You can run the entire stack on AWS (single GPU EC2 instance) or on your own GPU workstation with no cloud account needed. Either way, you will install a lightweight Kubernetes distribution, deploy experiment tracking and artifact storage, build a GPU-accelerated training container, and run a YOLOv5 object detection model -- all automated through shell scripts and fully reproducible from scratch.
- Infrastructure as Code -- Provisioning cloud GPU resources with scripted, repeatable deploys and teardowns.
- Containerized Training -- Building multi-stage Docker images with CUDA, PyTorch, and pinned dependencies.
- Pipeline Orchestration -- Defining and submitting ML workflows with Kubeflow Pipelines.
- Experiment Tracking -- Recording hyperparameters, metrics, and artifacts with MLflow.
- Artifact Management -- Storing model checkpoints and training outputs in MinIO (S3-compatible object storage).
- GPU Scheduling -- Configuring the NVIDIA device plugin so Kubernetes can allocate GPU resources to training pods.
- SSH Tunneling -- Accessing remote dashboards securely without opening extra ports.
- Cloud Cost Management -- Running on a single spot-eligible instance with idempotent teardown to avoid surprise bills.
The entire stack runs on one machine (an EC2 instance or your own GPU workstation). Your local machine connects via SSH, and all services are orchestrated by K3s (a lightweight Kubernetes distribution) on the remote host.
graph LR
A[Local Machine] -->|SSH Tunnel| B[EC2 g4dn.xlarge]
B --> C[K3s Cluster]
C --> D[Kubeflow Pipelines]
C --> E[MLflow Server]
C --> F[MinIO Object Store]
D -->|Launches| G[Training Pod]
G -->|Logs metrics| E
G -->|Stores artifacts| F
G -->|GPU via NVIDIA Plugin| H[Tesla T4 GPU]
For a detailed walkthrough of every component and how they interact, see Architecture Deep Dive.
| Technology | Role | Version |
|---|---|---|
| AWS EC2 (g4dn.xlarge) | GPU compute instance (Tesla T4, 16 GB VRAM) | -- |
| K3s | Lightweight Kubernetes distribution | Latest stable |
| Kubeflow Pipelines | ML pipeline orchestration | 2.4.1 (standalone) |
| MLflow | Experiment tracking and model registry | 2.14.3 |
| MinIO | S3-compatible artifact storage (from KFP manifests) | Bundled with KFP |
| Docker | Container image builds | Provided by AWS Deep Learning AMI |
| PyTorch | Deep learning framework (CUDA 12.1) | 2.5.1 |
| YOLOv5 | Object detection model | v7.0 |
| KFP SDK | Pipeline compilation and submission | 2.4.0 |
For the AWS path:
- AWS account with permissions to manage EC2 instances, key pairs, security groups, and EBS volumes in
us-east-1. Theg4dn.xlargeinstance costs approximately $0.526/hr on-demand. - AWS CLI installed (the setup script will install it into a virtual environment, but you can also use a system-wide installation).
For the local/on-premises path:
- A Linux GPU workstation with NVIDIA drivers, Docker, and NVIDIA Container Toolkit installed. See Module 03: System Setup Guide.
- No AWS account or cloud costs required.
Both paths:
- Local machine running macOS, Linux, or Windows with WSL2 (all shell scripts assume a Unix-like environment; Windows users must run everything inside WSL2).
- Python 3.10+ installed and available as
python3. - SSH client available on the command line.
mlops-engineering-101/
|-- local_setup.sh # Deploys full stack on a local GPU workstation (no cloud)
|-- local_connect.sh # SSH tunnel from laptop to workstation dashboards
|-- setup_env.sh # Creates the local Python venv for AWS CLI tooling
|-- deploy.sh # Provisions the EC2 instance and verifies GPU access
|-- step2.sh # Copies files to EC2, runs remote setup, retrieves pipeline
|-- remote_setup.sh # EC2-side: installs K3s, KFP, MLflow, builds training image
|-- run_pipeline.sh # Submits the compiled pipeline to Kubeflow and monitors it
|-- teardown.sh # Terminates all tagged AWS resources and verifies cleanup
|-- pipeline.py # Defines the Kubeflow pipeline (compiled to YAML)
|-- train_wrapper.py # MLflow-integrated wrapper around YOLOv5 train.py
|-- submit_run.py # Python helper that submits a pipeline run via the KFP SDK
|-- Dockerfile # Multi-stage build: CUDA 12.1 + PyTorch 2.5.1 + YOLOv5 v7.0
|-- mlflow.yaml # Kubernetes manifests for the MLflow deployment
|-- requirements-local.txt # Local-side Python deps (awscli, boto3)
|-- requirements-remote.txt # Remote-side Python deps (kfp, kfp-kubernetes)
|-- lib/
| +-- common.sh # Shared config, defaults, AWS helpers, SSH wrappers
|-- k8s/
| |-- nvidia-device-plugin.yaml # NVIDIA device plugin DaemonSet manifest
| +-- gpu-smoke-test.yaml # One-shot pod to validate GPU access in K3s
|-- docs/
| |-- architecture.md # Full architecture deep dive
| |-- pipeline-deep-dive.md # Pipeline compilation and execution explained
| |-- docker-explained.md # Dockerfile walkthrough and container design
| |-- mlflow-explained.md # MLflow setup, tracking, and artifact flow
| |-- troubleshooting.md # Common issues and solutions
| +-- glossary.md # MLOps terminology reference
|-- training/ # MLOps 101 self-paced curriculum (10 modules)
| |-- 00-foundations/ # Python venvs, shell scripting, Git
| |-- 01-docker-deep-dive/ # Containers, Dockerfiles, GPU support
| |-- 02-kubernetes-essentials/ # Pods, Deployments, K3s, GPU scheduling
| |-- 03-ml-workstation-setup/ # Docker dev workflow, GPU sharing, team images, local stack
| |-- 04-mlflow-fundamentals/ # Experiment tracking, local/K8s setup
| |-- 05-kubeflow-pipelines/ # Pipeline authoring, execution, debugging
| |-- 06-remote-build-infrastructure/ # AWS EC2, SSH, cost management
| |-- 07-putting-it-all-together/ # End-to-end, production path, scaling
| |-- 08-data-and-experiment-management/ # DVC, datasets, MLflow team practices
| |-- 09-ai-assisted-development/ # Coding assistants, agents, tools landscape
| +-- 10-local-onprem-setup/ # Deploy on your own GPU workstation (no cloud)
|-- generated/ # Created at runtime; holds compiled pipeline.yaml
+-- .state/ # Created at runtime; holds deploy.env with instance details
./setup_env.sh # 1. Create local Python venv
source .venv/bin/activate && aws configure # 2. Activate venv, set AWS credentials
./deploy.sh # 3. Provision EC2 GPU instance (~5 min)
./step2.sh # 4. Install K3s, KFP, MLflow, build image (~15 min)
./run_pipeline.sh # 5. Submit the training pipelineWindows users (WSL2): Run all commands inside a WSL2 terminal. Ensure
python3,ssh, andawsare available inside WSL2, not from the Windows host.
If you have a Linux GPU workstation and do not want to use AWS, you can deploy the entire stack locally. No AWS account, no cloud costs.
git clone <your-repo-url> mlops-engineering-101
cd mlops-engineering-101
./local_setup.sh # Deploy K3s + KFP + MLflow (~15 min)After setup completes, dashboards are available directly on the workstation:
- Kubeflow Pipelines: http://localhost:3000
- MLflow: http://localhost:5000
To submit a training run:
source .venv/bin/activate
python submit_run.py --host http://127.0.0.1:3000 --pipeline-file generated/pipeline.yamlTo connect from a remote laptop via SSH tunnel:
./local_connect.sh <workstation-ip>
# Then open http://localhost:8080 (KFP) and http://localhost:5000 (MLflow)The same pipeline.py, Dockerfile, train_wrapper.py, and Kubernetes manifests are used in both the AWS and local flows. For the full guide, see Module 10: Local / On-Premises Setup.
./setup_env.sh
source .venv/bin/activate
aws configureThis creates an isolated Python virtual environment at .venv/ and installs only the local-side dependencies (awscli, boto3). It does not install training libraries -- those live inside the Docker container on the remote instance. After activating the venv, configure your AWS credentials with aws configure (set the default region to us-east-1).
./deploy.shWhat happens:
- Resolves the latest Ubuntu 22.04 Deep Learning GPU AMI in
us-east-1. - Creates a tagged key pair (
mlops-101-key) and security group (mlops-101-sg). - Launches exactly one
g4dn.xlargeinstance with a 100 GB root volume. - Waits for the instance to pass status checks, accepts SSH connections, and confirms
nvidia-smiworks. - Saves all connection details (instance ID, public IP, key path) to
.state/deploy.env.
Expect this step to take about 5 minutes. When it completes, you will see the instance's public IP printed to the console.
./step2.shWhat happens:
- Copies all project files to the EC2 instance via
scp. - Runs
remote_setup.shon the instance, which:- Installs K3s configured with NVIDIA as the default container runtime.
- Applies the NVIDIA device plugin DaemonSet and runs a GPU smoke test.
- Builds the training Docker image on the host and imports it into K3s containerd.
- Deploys Kubeflow Pipelines (standalone) and waits for all components to become ready.
- Creates the MinIO
mlflowbucket and deploys the MLflow server into thekubeflownamespace. - Compiles
pipeline.pyintogenerated/pipeline.yaml. - Starts
kubectl port-forwardprocesses for Kubeflow (port 3000) and MLflow (port 5000).
- Copies the compiled
generated/pipeline.yamlback to your local machine.
Expect this step to take about 15 minutes on first run. The script prints progress at each stage.
After step2.sh completes, it prints an SSH tunnel command. Run it in a separate terminal:
source .state/deploy.env
ssh -i "$KEY_PEM_PATH" \
-L 8080:127.0.0.1:3000 \
-L 5000:127.0.0.1:5000 \
"ubuntu@$PUBLIC_IP"Then open in your browser:
- Kubeflow Pipelines UI: http://localhost:8080
- MLflow UI: http://localhost:5000
WSL2 users: The tunnel binds to
localhostinside WSL2. If your browser runs on the Windows host, this should work automatically. If not, bind to0.0.0.0by adding-L 0.0.0.0:8080:127.0.0.1:3000instead.
Option A -- Scripted (recommended):
./run_pipeline.shThis submits the compiled pipeline to Kubeflow, creates a run, and polls until completion. Optional flags:
./run_pipeline.sh --experiment yolov5-mlops-demo --run-name my-run
./run_pipeline.sh --no-wait # Submit and return immediatelyOption B -- Manual via the Kubeflow UI:
- Open http://localhost:8080.
- Click Upload Pipeline and select
generated/pipeline.yaml. - Create a run with the default parameters.
Once the training run completes, open the MLflow UI at http://localhost:5000. You will find:
- Experiment: The run is logged under the experiment name configured in the pipeline.
- Parameters: Model weights, dataset, epochs, image size, batch size, device.
- Metrics: Training loss, precision, recall, mAP values parsed from YOLOv5's
results.csv. - Artifacts: Training logs, best/last model weights, and any plots generated by YOLOv5.
Always tear down when you are done to stop incurring charges.
./teardown.shThis terminates the EC2 instance, deletes the key pair, removes the security group, and cleans up any tagged EBS volumes or Elastic IPs. The script verifies each resource is gone before declaring success.
To double-check manually:
source .state/deploy.env 2>/dev/null || true
aws --region us-east-1 ec2 describe-instances \
--filters Name=tag:Project,Values=mlops-engineering-101 \
Name=instance-state-name,Values=pending,running,stopping,stoppedExpected result: no non-terminated instances.
Both dashboards require an SSH tunnel because the services bind to 127.0.0.1 on the EC2 instance (no public ports are exposed beyond SSH).
source .state/deploy.env
ssh -i "$KEY_PEM_PATH" \
-L 8080:127.0.0.1:3000 \
-L 5000:127.0.0.1:5000 \
"ubuntu@$PUBLIC_IP"| Dashboard | Local URL | Remote Port | Purpose |
|---|---|---|---|
| Kubeflow Pipelines | http://localhost:8080 | 3000 | Pipeline runs, DAG visualization, logs |
| MLflow | http://localhost:5000 | 5000 | Experiment tracking, metrics, artifacts |
Leave the SSH session open for as long as you need dashboard access. Closing it will drop the tunnels.
The training container uses the following defaults (hardcoded in train_wrapper.py and pipeline.py):
| Parameter | Value | Notes |
|---|---|---|
weights |
yolov5s.pt |
YOLOv5 small model (pretrained on COCO) |
dataset |
coco128.yaml |
128-image subset of COCO (auto-downloaded) |
epochs |
1 |
Minimal run for validation; increase for real training |
imgsz |
640 |
Input image resolution |
batch |
8 |
Batch size tuned for T4 16 GB VRAM |
workers |
0 |
Dataloader workers; set to 0 for container compatibility |
device |
auto | Uses GPU if torch.cuda.is_available(), otherwise CPU |
Detailed explanations of every component are available in the docs/ directory:
- Architecture Deep Dive -- How all the pieces fit together, from EC2 to training pod.
- Pipeline Deep Dive -- How the Kubeflow pipeline is compiled, submitted, and executed.
- Docker Explained -- Walkthrough of the multi-stage Dockerfile and container design decisions.
- MLflow Explained -- MLflow server setup, tracking integration, and artifact storage via MinIO.
- Troubleshooting -- Comprehensive guide to diagnosing and fixing common issues.
- Glossary -- Definitions of MLOps, Kubernetes, and AWS terminology used in this project.
The training/ directory contains a structured, self-paced curriculum that teaches every concept used in this pipeline from the ground up. It is designed for engineers who can write Python but are new to containers, Kubernetes, and ML pipeline orchestration.
| Module | Topic | Time Estimate |
|---|---|---|
| 00 -- Foundations | Python venvs, shell scripting, Git for ML | 3-4 hr |
| 01 -- Docker Deep Dive | Containers, Dockerfiles, GPU support, best practices | 4-5 hr |
| 02 -- Kubernetes Essentials | Pods, Deployments, Services, K3s, GPU scheduling | 4-5 hr |
| 03 -- ML Workstation Setup | Docker dev workflow, GPU sharing, team images, local MLOps stack | 3-4 hr |
| 04 -- MLflow Fundamentals | Experiment tracking, local setup, K8s deployment, artifacts | 3-4 hr |
| 05 -- Kubeflow Pipelines | Pipeline authoring, compilation, execution, debugging | 4-5 hr |
| 06 -- Remote Build Infrastructure | AWS EC2, SSH tunneling, remote Docker builds, cost control | 3-4 hr |
| 07 -- Putting It All Together | End-to-end walkthrough, architecture decisions, production path, scaling | 3-4 hr |
| 08 -- Data & Experiment Management | DVC, dataset organization, MLflow team practices, model versioning | 2-3 hr |
| 09 -- AI-Assisted Development | Coding assistants, agent basics, modern tools landscape | 1.5-2 hr |
| 10 -- Local / On-Premises Setup | Deploy on your own GPU workstation, multi-node clusters | 4-5 hr |
Start with the Learning Path Overview for recommended order and prerequisites. Modules 0-5 are sufficient for the local setup path (Module 10). AWS modules (6-7) are only needed if you plan to use cloud infrastructure. Each module follows the pattern: Analogy -> Concept -> Hands-On -> Connect to the Pipeline.
| Resource | Cost | Notes |
|---|---|---|
g4dn.xlarge (on-demand) |
~$0.526/hr | Tesla T4 GPU, 4 vCPU, 16 GB RAM |
| 100 GB gp3 EBS volume | ~$0.08/GB/month | Included with the instance |
| Data transfer | Minimal | Only small model files and logs |
Practical guidance:
- A full setup-train-teardown cycle takes roughly 25-30 minutes, costing under $0.30.
- Always run
./teardown.shwhen you are finished. A forgotten instance costs ~$12.60/day. - Set up an AWS Billing Alarm to alert you if charges exceed a threshold (e.g., $5).
- Consider using AWS Free Tier or credits if available through your course or organization.
For the full troubleshooting guide, see docs/troubleshooting.md. The three most common issues are:
1. deploy.sh times out waiting for the instance
The Deep Learning AMI sometimes takes longer to initialize. Re-run ./deploy.sh -- it is idempotent and will detect the existing instance. If the instance is stuck, run ./teardown.sh and start fresh.
2. step2.sh fails during K3s or KFP installation
SSH into the instance and check the logs:
source .state/deploy.env
ssh -i "$KEY_PEM_PATH" "ubuntu@$PUBLIC_IP"
sudo systemctl status k3s
sudo journalctl -u k3s --no-pager | tail -50Common causes: K3s not yet ready (retry), or the NVIDIA runtime not configured (the AMI must include NVIDIA drivers).
3. Training pod fails with OOMKilled or GPU errors
Check pod status and logs:
source .state/deploy.env
ssh -i "$KEY_PEM_PATH" "ubuntu@$PUBLIC_IP" '
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml &&
kubectl -n kubeflow get pods --sort-by=.metadata.creationTimestamp &&
kubectl -n kubeflow logs $(kubectl -n kubeflow get pods --sort-by=.metadata.creationTimestamp -o name | tail -n 1)
'If the pod is OOMKilled, reduce batch size. If the GPU is not detected, verify the NVIDIA device plugin is running: kubectl -n kube-system get pods | grep nvidia.
- Single-node only. This is a learning setup, not a production cluster. There is no high availability, auto-scaling, or multi-node scheduling.
- MLflow uses SQLite. The MLflow backend store is a SQLite database on a single PVC. It is not designed for concurrent access or high availability.
- MinIO is bundled with KFP. The MinIO instance comes from the Kubeflow Pipelines standalone manifests and is not independently configured or scaled.
- AWS path depends on the Deep Learning AMI. The remote setup assumes NVIDIA drivers and Docker GPU support are pre-installed by the AMI. The local path (
local_setup.sh) requires you to install these prerequisites yourself (see Module 03). - No authentication. The Kubeflow and MLflow dashboards have no login or access control. They are only accessible through the SSH tunnel.
This project is provided as a learning exercise. See the repository for any applicable license terms.
Contributions, bug reports, and suggestions are welcome. If you find an issue or have an improvement:
- Open an issue describing the problem or idea.
- Fork the repository and make your changes.
- Submit a pull request with a clear description of what changed and why.
Please keep changes focused and test them against a fresh deploy cycle (setup_env.sh through teardown.sh) before submitting.