Skip to content

himanshudongre/mlops-engineering-101

Repository files navigation

MLOps 101: End-to-End ML Pipeline

This repository is a hands-on MLOps learning exercise that deploys a complete ML training pipeline. You can run the entire stack on AWS (single GPU EC2 instance) or on your own GPU workstation with no cloud account needed. Either way, you will install a lightweight Kubernetes distribution, deploy experiment tracking and artifact storage, build a GPU-accelerated training container, and run a YOLOv5 object detection model -- all automated through shell scripts and fully reproducible from scratch.


What You Will Learn

  • Infrastructure as Code -- Provisioning cloud GPU resources with scripted, repeatable deploys and teardowns.
  • Containerized Training -- Building multi-stage Docker images with CUDA, PyTorch, and pinned dependencies.
  • Pipeline Orchestration -- Defining and submitting ML workflows with Kubeflow Pipelines.
  • Experiment Tracking -- Recording hyperparameters, metrics, and artifacts with MLflow.
  • Artifact Management -- Storing model checkpoints and training outputs in MinIO (S3-compatible object storage).
  • GPU Scheduling -- Configuring the NVIDIA device plugin so Kubernetes can allocate GPU resources to training pods.
  • SSH Tunneling -- Accessing remote dashboards securely without opening extra ports.
  • Cloud Cost Management -- Running on a single spot-eligible instance with idempotent teardown to avoid surprise bills.

Architecture Overview

The entire stack runs on one machine (an EC2 instance or your own GPU workstation). Your local machine connects via SSH, and all services are orchestrated by K3s (a lightweight Kubernetes distribution) on the remote host.

graph LR
    A[Local Machine] -->|SSH Tunnel| B[EC2 g4dn.xlarge]
    B --> C[K3s Cluster]
    C --> D[Kubeflow Pipelines]
    C --> E[MLflow Server]
    C --> F[MinIO Object Store]
    D -->|Launches| G[Training Pod]
    G -->|Logs metrics| E
    G -->|Stores artifacts| F
    G -->|GPU via NVIDIA Plugin| H[Tesla T4 GPU]
Loading

For a detailed walkthrough of every component and how they interact, see Architecture Deep Dive.


Tech Stack

Technology Role Version
AWS EC2 (g4dn.xlarge) GPU compute instance (Tesla T4, 16 GB VRAM) --
K3s Lightweight Kubernetes distribution Latest stable
Kubeflow Pipelines ML pipeline orchestration 2.4.1 (standalone)
MLflow Experiment tracking and model registry 2.14.3
MinIO S3-compatible artifact storage (from KFP manifests) Bundled with KFP
Docker Container image builds Provided by AWS Deep Learning AMI
PyTorch Deep learning framework (CUDA 12.1) 2.5.1
YOLOv5 Object detection model v7.0
KFP SDK Pipeline compilation and submission 2.4.0

Prerequisites

For the AWS path:

  • AWS account with permissions to manage EC2 instances, key pairs, security groups, and EBS volumes in us-east-1. The g4dn.xlarge instance costs approximately $0.526/hr on-demand.
  • AWS CLI installed (the setup script will install it into a virtual environment, but you can also use a system-wide installation).

For the local/on-premises path:

  • A Linux GPU workstation with NVIDIA drivers, Docker, and NVIDIA Container Toolkit installed. See Module 03: System Setup Guide.
  • No AWS account or cloud costs required.

Both paths:

  • Local machine running macOS, Linux, or Windows with WSL2 (all shell scripts assume a Unix-like environment; Windows users must run everything inside WSL2).
  • Python 3.10+ installed and available as python3.
  • SSH client available on the command line.

Repository Structure

mlops-engineering-101/
|-- local_setup.sh            # Deploys full stack on a local GPU workstation (no cloud)
|-- local_connect.sh          # SSH tunnel from laptop to workstation dashboards
|-- setup_env.sh              # Creates the local Python venv for AWS CLI tooling
|-- deploy.sh                 # Provisions the EC2 instance and verifies GPU access
|-- step2.sh                  # Copies files to EC2, runs remote setup, retrieves pipeline
|-- remote_setup.sh           # EC2-side: installs K3s, KFP, MLflow, builds training image
|-- run_pipeline.sh           # Submits the compiled pipeline to Kubeflow and monitors it
|-- teardown.sh               # Terminates all tagged AWS resources and verifies cleanup
|-- pipeline.py               # Defines the Kubeflow pipeline (compiled to YAML)
|-- train_wrapper.py          # MLflow-integrated wrapper around YOLOv5 train.py
|-- submit_run.py             # Python helper that submits a pipeline run via the KFP SDK
|-- Dockerfile                # Multi-stage build: CUDA 12.1 + PyTorch 2.5.1 + YOLOv5 v7.0
|-- mlflow.yaml               # Kubernetes manifests for the MLflow deployment
|-- requirements-local.txt    # Local-side Python deps (awscli, boto3)
|-- requirements-remote.txt   # Remote-side Python deps (kfp, kfp-kubernetes)
|-- lib/
|   +-- common.sh             # Shared config, defaults, AWS helpers, SSH wrappers
|-- k8s/
|   |-- nvidia-device-plugin.yaml  # NVIDIA device plugin DaemonSet manifest
|   +-- gpu-smoke-test.yaml        # One-shot pod to validate GPU access in K3s
|-- docs/
|   |-- architecture.md       # Full architecture deep dive
|   |-- pipeline-deep-dive.md # Pipeline compilation and execution explained
|   |-- docker-explained.md   # Dockerfile walkthrough and container design
|   |-- mlflow-explained.md   # MLflow setup, tracking, and artifact flow
|   |-- troubleshooting.md    # Common issues and solutions
|   +-- glossary.md           # MLOps terminology reference
|-- training/                 # MLOps 101 self-paced curriculum (10 modules)
|   |-- 00-foundations/       # Python venvs, shell scripting, Git
|   |-- 01-docker-deep-dive/  # Containers, Dockerfiles, GPU support
|   |-- 02-kubernetes-essentials/  # Pods, Deployments, K3s, GPU scheduling
|   |-- 03-ml-workstation-setup/   # Docker dev workflow, GPU sharing, team images, local stack
|   |-- 04-mlflow-fundamentals/    # Experiment tracking, local/K8s setup
|   |-- 05-kubeflow-pipelines/     # Pipeline authoring, execution, debugging
|   |-- 06-remote-build-infrastructure/  # AWS EC2, SSH, cost management
|   |-- 07-putting-it-all-together/      # End-to-end, production path, scaling
|   |-- 08-data-and-experiment-management/  # DVC, datasets, MLflow team practices
|   |-- 09-ai-assisted-development/  # Coding assistants, agents, tools landscape
|   +-- 10-local-onprem-setup/  # Deploy on your own GPU workstation (no cloud)
|-- generated/                # Created at runtime; holds compiled pipeline.yaml
+-- .state/                   # Created at runtime; holds deploy.env with instance details

Quick Start: AWS (5 Commands)

./setup_env.sh                                    # 1. Create local Python venv
source .venv/bin/activate && aws configure         # 2. Activate venv, set AWS credentials
./deploy.sh                                        # 3. Provision EC2 GPU instance (~5 min)
./step2.sh                                         # 4. Install K3s, KFP, MLflow, build image (~15 min)
./run_pipeline.sh                                  # 5. Submit the training pipeline

Windows users (WSL2): Run all commands inside a WSL2 terminal. Ensure python3, ssh, and aws are available inside WSL2, not from the Windows host.


Quick Start: Local Workstation (No Cloud Required)

If you have a Linux GPU workstation and do not want to use AWS, you can deploy the entire stack locally. No AWS account, no cloud costs.

git clone <your-repo-url> mlops-engineering-101
cd mlops-engineering-101
./local_setup.sh                                   # Deploy K3s + KFP + MLflow (~15 min)

After setup completes, dashboards are available directly on the workstation:

To submit a training run:

source .venv/bin/activate
python submit_run.py --host http://127.0.0.1:3000 --pipeline-file generated/pipeline.yaml

To connect from a remote laptop via SSH tunnel:

./local_connect.sh <workstation-ip>
# Then open http://localhost:8080 (KFP) and http://localhost:5000 (MLflow)

The same pipeline.py, Dockerfile, train_wrapper.py, and Kubernetes manifests are used in both the AWS and local flows. For the full guide, see Module 10: Local / On-Premises Setup.


Step-by-Step Walkthrough (AWS Path)

Step 1: Local Environment Setup

./setup_env.sh
source .venv/bin/activate
aws configure

This creates an isolated Python virtual environment at .venv/ and installs only the local-side dependencies (awscli, boto3). It does not install training libraries -- those live inside the Docker container on the remote instance. After activating the venv, configure your AWS credentials with aws configure (set the default region to us-east-1).

Step 2: Provision the EC2 GPU Instance

./deploy.sh

What happens:

  • Resolves the latest Ubuntu 22.04 Deep Learning GPU AMI in us-east-1.
  • Creates a tagged key pair (mlops-101-key) and security group (mlops-101-sg).
  • Launches exactly one g4dn.xlarge instance with a 100 GB root volume.
  • Waits for the instance to pass status checks, accepts SSH connections, and confirms nvidia-smi works.
  • Saves all connection details (instance ID, public IP, key path) to .state/deploy.env.

Expect this step to take about 5 minutes. When it completes, you will see the instance's public IP printed to the console.

Step 3: Build the Remote Stack

./step2.sh

What happens:

  • Copies all project files to the EC2 instance via scp.
  • Runs remote_setup.sh on the instance, which:
    • Installs K3s configured with NVIDIA as the default container runtime.
    • Applies the NVIDIA device plugin DaemonSet and runs a GPU smoke test.
    • Builds the training Docker image on the host and imports it into K3s containerd.
    • Deploys Kubeflow Pipelines (standalone) and waits for all components to become ready.
    • Creates the MinIO mlflow bucket and deploys the MLflow server into the kubeflow namespace.
    • Compiles pipeline.py into generated/pipeline.yaml.
    • Starts kubectl port-forward processes for Kubeflow (port 3000) and MLflow (port 5000).
  • Copies the compiled generated/pipeline.yaml back to your local machine.

Expect this step to take about 15 minutes on first run. The script prints progress at each stage.

Step 4: Open Dashboards via SSH Tunnel

After step2.sh completes, it prints an SSH tunnel command. Run it in a separate terminal:

source .state/deploy.env
ssh -i "$KEY_PEM_PATH" \
  -L 8080:127.0.0.1:3000 \
  -L 5000:127.0.0.1:5000 \
  "ubuntu@$PUBLIC_IP"

Then open in your browser:

WSL2 users: The tunnel binds to localhost inside WSL2. If your browser runs on the Windows host, this should work automatically. If not, bind to 0.0.0.0 by adding -L 0.0.0.0:8080:127.0.0.1:3000 instead.

Step 5: Run the Pipeline

Option A -- Scripted (recommended):

./run_pipeline.sh

This submits the compiled pipeline to Kubeflow, creates a run, and polls until completion. Optional flags:

./run_pipeline.sh --experiment yolov5-mlops-demo --run-name my-run
./run_pipeline.sh --no-wait    # Submit and return immediately

Option B -- Manual via the Kubeflow UI:

  1. Open http://localhost:8080.
  2. Click Upload Pipeline and select generated/pipeline.yaml.
  3. Create a run with the default parameters.

Step 6: Inspect Results in MLflow

Once the training run completes, open the MLflow UI at http://localhost:5000. You will find:

  • Experiment: The run is logged under the experiment name configured in the pipeline.
  • Parameters: Model weights, dataset, epochs, image size, batch size, device.
  • Metrics: Training loss, precision, recall, mAP values parsed from YOLOv5's results.csv.
  • Artifacts: Training logs, best/last model weights, and any plots generated by YOLOv5.

Step 7: Tear Down Resources

Always tear down when you are done to stop incurring charges.

./teardown.sh

This terminates the EC2 instance, deletes the key pair, removes the security group, and cleans up any tagged EBS volumes or Elastic IPs. The script verifies each resource is gone before declaring success.

To double-check manually:

source .state/deploy.env 2>/dev/null || true
aws --region us-east-1 ec2 describe-instances \
  --filters Name=tag:Project,Values=mlops-engineering-101 \
            Name=instance-state-name,Values=pending,running,stopping,stopped

Expected result: no non-terminated instances.


Accessing the Dashboards

Both dashboards require an SSH tunnel because the services bind to 127.0.0.1 on the EC2 instance (no public ports are exposed beyond SSH).

source .state/deploy.env
ssh -i "$KEY_PEM_PATH" \
  -L 8080:127.0.0.1:3000 \
  -L 5000:127.0.0.1:5000 \
  "ubuntu@$PUBLIC_IP"
Dashboard Local URL Remote Port Purpose
Kubeflow Pipelines http://localhost:8080 3000 Pipeline runs, DAG visualization, logs
MLflow http://localhost:5000 5000 Experiment tracking, metrics, artifacts

Leave the SSH session open for as long as you need dashboard access. Closing it will drop the tunnels.


Training Defaults

The training container uses the following defaults (hardcoded in train_wrapper.py and pipeline.py):

Parameter Value Notes
weights yolov5s.pt YOLOv5 small model (pretrained on COCO)
dataset coco128.yaml 128-image subset of COCO (auto-downloaded)
epochs 1 Minimal run for validation; increase for real training
imgsz 640 Input image resolution
batch 8 Batch size tuned for T4 16 GB VRAM
workers 0 Dataloader workers; set to 0 for container compatibility
device auto Uses GPU if torch.cuda.is_available(), otherwise CPU

Documentation

Detailed explanations of every component are available in the docs/ directory:

  • Architecture Deep Dive -- How all the pieces fit together, from EC2 to training pod.
  • Pipeline Deep Dive -- How the Kubeflow pipeline is compiled, submitted, and executed.
  • Docker Explained -- Walkthrough of the multi-stage Dockerfile and container design decisions.
  • MLflow Explained -- MLflow server setup, tracking integration, and artifact storage via MinIO.
  • Troubleshooting -- Comprehensive guide to diagnosing and fixing common issues.
  • Glossary -- Definitions of MLOps, Kubernetes, and AWS terminology used in this project.

Learning Path (MLOps 101 Curriculum)

The training/ directory contains a structured, self-paced curriculum that teaches every concept used in this pipeline from the ground up. It is designed for engineers who can write Python but are new to containers, Kubernetes, and ML pipeline orchestration.

Module Topic Time Estimate
00 -- Foundations Python venvs, shell scripting, Git for ML 3-4 hr
01 -- Docker Deep Dive Containers, Dockerfiles, GPU support, best practices 4-5 hr
02 -- Kubernetes Essentials Pods, Deployments, Services, K3s, GPU scheduling 4-5 hr
03 -- ML Workstation Setup Docker dev workflow, GPU sharing, team images, local MLOps stack 3-4 hr
04 -- MLflow Fundamentals Experiment tracking, local setup, K8s deployment, artifacts 3-4 hr
05 -- Kubeflow Pipelines Pipeline authoring, compilation, execution, debugging 4-5 hr
06 -- Remote Build Infrastructure AWS EC2, SSH tunneling, remote Docker builds, cost control 3-4 hr
07 -- Putting It All Together End-to-end walkthrough, architecture decisions, production path, scaling 3-4 hr
08 -- Data & Experiment Management DVC, dataset organization, MLflow team practices, model versioning 2-3 hr
09 -- AI-Assisted Development Coding assistants, agent basics, modern tools landscape 1.5-2 hr
10 -- Local / On-Premises Setup Deploy on your own GPU workstation, multi-node clusters 4-5 hr

Start with the Learning Path Overview for recommended order and prerequisites. Modules 0-5 are sufficient for the local setup path (Module 10). AWS modules (6-7) are only needed if you plan to use cloud infrastructure. Each module follows the pattern: Analogy -> Concept -> Hands-On -> Connect to the Pipeline.


Cost Awareness

Resource Cost Notes
g4dn.xlarge (on-demand) ~$0.526/hr Tesla T4 GPU, 4 vCPU, 16 GB RAM
100 GB gp3 EBS volume ~$0.08/GB/month Included with the instance
Data transfer Minimal Only small model files and logs

Practical guidance:

  • A full setup-train-teardown cycle takes roughly 25-30 minutes, costing under $0.30.
  • Always run ./teardown.sh when you are finished. A forgotten instance costs ~$12.60/day.
  • Set up an AWS Billing Alarm to alert you if charges exceed a threshold (e.g., $5).
  • Consider using AWS Free Tier or credits if available through your course or organization.

Troubleshooting

For the full troubleshooting guide, see docs/troubleshooting.md. The three most common issues are:

1. deploy.sh times out waiting for the instance

The Deep Learning AMI sometimes takes longer to initialize. Re-run ./deploy.sh -- it is idempotent and will detect the existing instance. If the instance is stuck, run ./teardown.sh and start fresh.

2. step2.sh fails during K3s or KFP installation

SSH into the instance and check the logs:

source .state/deploy.env
ssh -i "$KEY_PEM_PATH" "ubuntu@$PUBLIC_IP"
sudo systemctl status k3s
sudo journalctl -u k3s --no-pager | tail -50

Common causes: K3s not yet ready (retry), or the NVIDIA runtime not configured (the AMI must include NVIDIA drivers).

3. Training pod fails with OOMKilled or GPU errors

Check pod status and logs:

source .state/deploy.env
ssh -i "$KEY_PEM_PATH" "ubuntu@$PUBLIC_IP" '
  export KUBECONFIG=/etc/rancher/k3s/k3s.yaml &&
  kubectl -n kubeflow get pods --sort-by=.metadata.creationTimestamp &&
  kubectl -n kubeflow logs $(kubectl -n kubeflow get pods --sort-by=.metadata.creationTimestamp -o name | tail -n 1)
'

If the pod is OOMKilled, reduce batch size. If the GPU is not detected, verify the NVIDIA device plugin is running: kubectl -n kube-system get pods | grep nvidia.


Known Limitations

  • Single-node only. This is a learning setup, not a production cluster. There is no high availability, auto-scaling, or multi-node scheduling.
  • MLflow uses SQLite. The MLflow backend store is a SQLite database on a single PVC. It is not designed for concurrent access or high availability.
  • MinIO is bundled with KFP. The MinIO instance comes from the Kubeflow Pipelines standalone manifests and is not independently configured or scaled.
  • AWS path depends on the Deep Learning AMI. The remote setup assumes NVIDIA drivers and Docker GPU support are pre-installed by the AMI. The local path (local_setup.sh) requires you to install these prerequisites yourself (see Module 03).
  • No authentication. The Kubeflow and MLflow dashboards have no login or access control. They are only accessible through the SSH tunnel.

License

This project is provided as a learning exercise. See the repository for any applicable license terms.


Contributing

Contributions, bug reports, and suggestions are welcome. If you find an issue or have an improvement:

  1. Open an issue describing the problem or idea.
  2. Fork the repository and make your changes.
  3. Submit a pull request with a clear description of what changed and why.

Please keep changes focused and test them against a fresh deploy cycle (setup_env.sh through teardown.sh) before submitting.

About

MLOps 101: Zero-to-hero guide for ML engineers. End-to-end training pipeline (YOLOv5 + KFP + MLflow + K3s on AWS), 10-module curriculum, Docker GPU dev workflows, workstation setup, and team best practices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors