< Back to main README | Docs Index
This document explains how MLflow is used in this project for experiment tracking. It is written for ML practitioners who currently track experiments in spreadsheets, notebooks, or ad-hoc notes and want to understand a production-grade approach.
If you have ever trained a model, tweaked a hyperparameter, and then asked yourself "wait, which run actually had the better mAP?", you already understand the problem. Manual tracking -- whether in a spreadsheet, a lab notebook, or a folder of screenshots -- breaks down quickly:
- The "which run was it?" problem. After dozens of experiments you cannot reliably recall which combination of learning rate, batch size, image size, and augmentation settings produced the best result.
- Reproducibility. Rerunning a past experiment requires knowing the exact code version, data version, and hyperparameters. A spreadsheet rarely captures all of these.
- Collaboration. When multiple people work on the same model, there is no single source of truth. Results live on different laptops and Slack threads.
- Artifact management. Training produces many files -- weight checkpoints, logs, confusion matrices, sample predictions. These need to be stored alongside the metadata that produced them.
MLflow solves these problems by providing a centralized tracking server where every training run is recorded automatically with its parameters, metrics, artifacts, and metadata.
Before looking at the project code, here are the key MLflow abstractions:
| Concept | What it is |
|---|---|
| Tracking URI | The address of the MLflow server. All API calls (logging params, metrics, artifacts) go here. In this project it is http://mlflow-service.kubeflow.svc.cluster.local:5000. |
| Experiment | A named group of runs that belong together. Think of it as a project folder. This project uses yolov5-coco128-demo. |
| Run | A single execution of training code. Each run gets a unique ID and records everything that happened during that execution. |
| Parameters | Key-value pairs that describe the configuration of a run (e.g., epochs=1, batch=8, imgsz=640). Logged once at the start. |
| Metrics | Numeric measurements that describe the outcome of a run (e.g., metrics/mAP_0.5, loss values). Can be logged at each step or once at the end. |
| Artifacts | Files produced by the run -- model weights, training logs, plots, CSV files. Stored in an artifact backend (MinIO in this project). |
| Tags | Free-form key-value metadata attached to a run (e.g., device=cpu, dataset=coco128.yaml). Useful for filtering and searching. |
The MLflow server is defined in mlflow.yaml and runs inside the Kubernetes
cluster. The deployment has three main components:
The Deployment uses the python:3.10-slim base image and installs MLflow at
container startup:
command:
- /bin/bash
- -lc
args:
- >-
pip install --no-cache-dir mlflow==2.14.3 boto3 &&
exec mlflow server
--host 0.0.0.0
--port 5000
--backend-store-uri sqlite:////mlflow-data/mlflow.db
--serve-artifacts
--artifacts-destination s3://mlflowThis approach avoids building a custom Docker image -- the trade-off is a slightly slower pod startup while pip runs.
Experiment metadata (run IDs, parameters, metrics, tags, run status) is stored
in a SQLite database at /mlflow-data/mlflow.db. This path lives on a
Kubernetes PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlflow-data
namespace: kubeflow
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5GiThe PVC ensures that metadata survives pod restarts. SQLite is sufficient for a single-user or small-team setup. For production workloads with concurrent writes, you would switch to PostgreSQL or MySQL.
Artifacts (model weights, logs, plots) are stored in a MinIO bucket named
mlflow. MLflow is configured to proxy artifact uploads via --serve-artifacts
and route them to s3://mlflow via --artifacts-destination.
The connection credentials come from a Kubernetes Secret:
envFrom:
- secretRef:
name: mlflow-minio-credentialsThis secret provides the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and
MLFLOW_S3_ENDPOINT_URL environment variables that MLflow (and its internal
boto3 client) use to communicate with MinIO.
A ClusterIP Service exposes the MLflow server to other pods in the cluster:
apiVersion: v1
kind: Service
metadata:
name: mlflow-service
namespace: kubeflow
spec:
type: ClusterIP
selector:
app: mlflow
ports:
- name: http
port: 5000
targetPort: 5000Training pods reach MLflow at
http://mlflow-service.kubeflow.svc.cluster.local:5000.
The file train_wrapper.py is the entry point for training. It wraps YOLOv5's
train.py and handles all MLflow integration. Here is the lifecycle, step by
step.
The --mlflow-uri argument is passed in from the Kubernetes Job spec. The
wrapper uses it to point the MLflow client at the in-cluster server:
mlflow.set_tracking_uri(args.mlflow_uri)All runs are grouped under a single experiment. If the experiment does not exist yet, MLflow creates it automatically:
mlflow.set_experiment(EXPERIMENT_NAME)where EXPERIMENT_NAME = "yolov5-coco128-demo".
Each training invocation starts a new MLflow run, named with a timestamp for easy identification:
timestamp = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
run_name = f"yolov5-coco128-{timestamp}"
# ...
mlflow.start_run(run_name=run_name)Tags capture metadata that is not a hyperparameter -- things like which device was used and whether a GPU was available:
mlflow.set_tags(
{
"project": "mlops-engineering-101",
"dataset": DEFAULTS["dataset"],
"weights": DEFAULTS["weights"],
"device": device,
"gpu_available": str(device != "cpu").lower(),
}
)Hyperparameters are logged once, before training starts:
mlflow.log_params(
{
"epochs": DEFAULTS["epochs"],
"imgsz": DEFAULTS["imgsz"],
"batch": DEFAULTS["batch"],
"dataset": DEFAULTS["dataset"],
"weights": DEFAULTS["weights"],
"device": device,
}
)YOLOv5's train.py is launched as a subprocess. Output is streamed to both
stdout (for kubectl logs) and a log file (for artifact upload):
stream_command(cmd, log_path)After training completes, YOLOv5 writes per-epoch metrics to results.csv.
The wrapper parses the final row and logs all numeric values:
log_results_csv(results_csv)Inside log_results_csv, metrics are sanitized (see section 5 below) and
logged in one call:
mlflow.log_metrics(metrics)Two types of artifacts are uploaded:
-
The training log -- a single file placed under the
logs/artifact path:mlflow.log_artifact(str(log_path), artifact_path="logs")
-
All training outputs -- the entire run directory (weights, plots, results.csv, etc.) placed under
training-output/:mlflow.log_artifacts(str(run_dir), artifact_path="training-output")
Note the difference: log_artifact uploads a single file, while
log_artifacts (plural) uploads an entire directory.
The run is marked as either FINISHED or FAILED:
mlflow.end_run(status="FINISHED")Importantly, even when training fails, the wrapper still uploads whatever
artifacts exist (the log file, any partial outputs) before marking the run as
FAILED. This makes debugging easier because you can inspect the logs from
the MLflow UI rather than hunting through Kubernetes pod logs.
YOLOv5 writes metric names with colons in results.csv, for example:
metrics/mAP_0.5:0.95
MLflow rejects colons in metric names. The wrapper handles this by replacing
: with _:
sanitized_key = key.strip().replace(":", "_")So metrics/mAP_0.5:0.95 becomes metrics/mAP_0.5_0.95 in the MLflow UI.
This is a simple but necessary transformation -- without it, mlflow.log_metrics
raises an error and no metrics are recorded.
The MLflow web UI lets you browse experiments, compare runs, and inspect artifacts. Since the Service is ClusterIP (not exposed externally), you access it through an SSH tunnel.
From your local machine, forward port 5000 through your EC2 bastion host:
ssh -L 5000:localhost:5000 -N <your-ec2-user>@<your-ec2-ip> &Then, on the EC2 instance, forward from the Kubernetes cluster:
kubectl port-forward svc/mlflow-service 5000:5000 -n kubeflowNavigate to http://localhost:5000 in your browser. You will see:
- Experiments panel (left sidebar): Lists all experiments. Click
yolov5-coco128-demoto see its runs. - Runs table: Each row is a training run. Columns show parameters, metrics, tags, and timestamps. Click a run name to see its detail page.
- Run detail page: Shows all logged parameters, metrics, and tags. The "Artifacts" tab lets you browse and download uploaded files (logs, weights, plots).
- Compare runs: Select multiple runs with the checkboxes, then click "Compare" to see parameters and metrics side by side.
All artifacts uploaded via mlflow.log_artifact and mlflow.log_artifacts
are stored in the MinIO bucket named mlflow. The directory structure inside
the bucket mirrors the MLflow artifact paths:
s3://mlflow/
<experiment-id>/
<run-id>/
artifacts/
logs/
train.log
training-output/
weights/
best.pt
last.pt
results.csv
results.png
confusion_matrix.png
...
You can also browse these artifacts directly in the MinIO console if you have access, but the MLflow UI is the preferred interface since it links artifacts to their corresponding run metadata.