Skip to content

Latest commit

 

History

History
369 lines (274 loc) · 10.6 KB

File metadata and controls

369 lines (274 loc) · 10.6 KB

< Back to main README | Docs Index


MLflow Experiment Tracking in This Project

This document explains how MLflow is used in this project for experiment tracking. It is written for ML practitioners who currently track experiments in spreadsheets, notebooks, or ad-hoc notes and want to understand a production-grade approach.


1. Why Experiment Tracking?

If you have ever trained a model, tweaked a hyperparameter, and then asked yourself "wait, which run actually had the better mAP?", you already understand the problem. Manual tracking -- whether in a spreadsheet, a lab notebook, or a folder of screenshots -- breaks down quickly:

  • The "which run was it?" problem. After dozens of experiments you cannot reliably recall which combination of learning rate, batch size, image size, and augmentation settings produced the best result.
  • Reproducibility. Rerunning a past experiment requires knowing the exact code version, data version, and hyperparameters. A spreadsheet rarely captures all of these.
  • Collaboration. When multiple people work on the same model, there is no single source of truth. Results live on different laptops and Slack threads.
  • Artifact management. Training produces many files -- weight checkpoints, logs, confusion matrices, sample predictions. These need to be stored alongside the metadata that produced them.

MLflow solves these problems by providing a centralized tracking server where every training run is recorded automatically with its parameters, metrics, artifacts, and metadata.


2. MLflow Core Concepts

Before looking at the project code, here are the key MLflow abstractions:

Concept What it is
Tracking URI The address of the MLflow server. All API calls (logging params, metrics, artifacts) go here. In this project it is http://mlflow-service.kubeflow.svc.cluster.local:5000.
Experiment A named group of runs that belong together. Think of it as a project folder. This project uses yolov5-coco128-demo.
Run A single execution of training code. Each run gets a unique ID and records everything that happened during that execution.
Parameters Key-value pairs that describe the configuration of a run (e.g., epochs=1, batch=8, imgsz=640). Logged once at the start.
Metrics Numeric measurements that describe the outcome of a run (e.g., metrics/mAP_0.5, loss values). Can be logged at each step or once at the end.
Artifacts Files produced by the run -- model weights, training logs, plots, CSV files. Stored in an artifact backend (MinIO in this project).
Tags Free-form key-value metadata attached to a run (e.g., device=cpu, dataset=coco128.yaml). Useful for filtering and searching.

3. How MLflow Is Deployed in This Project

The MLflow server is defined in mlflow.yaml and runs inside the Kubernetes cluster. The deployment has three main components:

3.1 Container Image

The Deployment uses the python:3.10-slim base image and installs MLflow at container startup:

command:
- /bin/bash
- -lc
args:
- >-
  pip install --no-cache-dir mlflow==2.14.3 boto3 &&
  exec mlflow server
  --host 0.0.0.0
  --port 5000
  --backend-store-uri sqlite:////mlflow-data/mlflow.db
  --serve-artifacts
  --artifacts-destination s3://mlflow

This approach avoids building a custom Docker image -- the trade-off is a slightly slower pod startup while pip runs.

3.2 SQLite Backend Store

Experiment metadata (run IDs, parameters, metrics, tags, run status) is stored in a SQLite database at /mlflow-data/mlflow.db. This path lives on a Kubernetes PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mlflow-data
  namespace: kubeflow
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

The PVC ensures that metadata survives pod restarts. SQLite is sufficient for a single-user or small-team setup. For production workloads with concurrent writes, you would switch to PostgreSQL or MySQL.

3.3 MinIO as S3-Compatible Artifact Storage

Artifacts (model weights, logs, plots) are stored in a MinIO bucket named mlflow. MLflow is configured to proxy artifact uploads via --serve-artifacts and route them to s3://mlflow via --artifacts-destination.

The connection credentials come from a Kubernetes Secret:

envFrom:
- secretRef:
    name: mlflow-minio-credentials

This secret provides the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and MLFLOW_S3_ENDPOINT_URL environment variables that MLflow (and its internal boto3 client) use to communicate with MinIO.

3.4 In-Cluster Service

A ClusterIP Service exposes the MLflow server to other pods in the cluster:

apiVersion: v1
kind: Service
metadata:
  name: mlflow-service
  namespace: kubeflow
spec:
  type: ClusterIP
  selector:
    app: mlflow
  ports:
  - name: http
    port: 5000
    targetPort: 5000

Training pods reach MLflow at http://mlflow-service.kubeflow.svc.cluster.local:5000.


4. How train_wrapper.py Logs to MLflow

The file train_wrapper.py is the entry point for training. It wraps YOLOv5's train.py and handles all MLflow integration. Here is the lifecycle, step by step.

4.1 Connect to the Tracking Server

The --mlflow-uri argument is passed in from the Kubernetes Job spec. The wrapper uses it to point the MLflow client at the in-cluster server:

mlflow.set_tracking_uri(args.mlflow_uri)

4.2 Select the Experiment

All runs are grouped under a single experiment. If the experiment does not exist yet, MLflow creates it automatically:

mlflow.set_experiment(EXPERIMENT_NAME)

where EXPERIMENT_NAME = "yolov5-coco128-demo".

4.3 Start a Run

Each training invocation starts a new MLflow run, named with a timestamp for easy identification:

timestamp = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
run_name = f"yolov5-coco128-{timestamp}"
# ...
mlflow.start_run(run_name=run_name)

4.4 Log Tags

Tags capture metadata that is not a hyperparameter -- things like which device was used and whether a GPU was available:

mlflow.set_tags(
    {
        "project": "mlops-engineering-101",
        "dataset": DEFAULTS["dataset"],
        "weights": DEFAULTS["weights"],
        "device": device,
        "gpu_available": str(device != "cpu").lower(),
    }
)

4.5 Log Parameters

Hyperparameters are logged once, before training starts:

mlflow.log_params(
    {
        "epochs": DEFAULTS["epochs"],
        "imgsz": DEFAULTS["imgsz"],
        "batch": DEFAULTS["batch"],
        "dataset": DEFAULTS["dataset"],
        "weights": DEFAULTS["weights"],
        "device": device,
    }
)

4.6 Run Training

YOLOv5's train.py is launched as a subprocess. Output is streamed to both stdout (for kubectl logs) and a log file (for artifact upload):

stream_command(cmd, log_path)

4.7 Log Metrics from results.csv

After training completes, YOLOv5 writes per-epoch metrics to results.csv. The wrapper parses the final row and logs all numeric values:

log_results_csv(results_csv)

Inside log_results_csv, metrics are sanitized (see section 5 below) and logged in one call:

mlflow.log_metrics(metrics)

4.8 Log Artifacts

Two types of artifacts are uploaded:

  1. The training log -- a single file placed under the logs/ artifact path:

    mlflow.log_artifact(str(log_path), artifact_path="logs")
  2. All training outputs -- the entire run directory (weights, plots, results.csv, etc.) placed under training-output/:

    mlflow.log_artifacts(str(run_dir), artifact_path="training-output")

Note the difference: log_artifact uploads a single file, while log_artifacts (plural) uploads an entire directory.

4.9 End the Run

The run is marked as either FINISHED or FAILED:

mlflow.end_run(status="FINISHED")

Importantly, even when training fails, the wrapper still uploads whatever artifacts exist (the log file, any partial outputs) before marking the run as FAILED. This makes debugging easier because you can inspect the logs from the MLflow UI rather than hunting through Kubernetes pod logs.


5. Metric Name Sanitization

YOLOv5 writes metric names with colons in results.csv, for example:

metrics/mAP_0.5:0.95

MLflow rejects colons in metric names. The wrapper handles this by replacing : with _:

sanitized_key = key.strip().replace(":", "_")

So metrics/mAP_0.5:0.95 becomes metrics/mAP_0.5_0.95 in the MLflow UI. This is a simple but necessary transformation -- without it, mlflow.log_metrics raises an error and no metrics are recorded.


6. Accessing the MLflow UI

The MLflow web UI lets you browse experiments, compare runs, and inspect artifacts. Since the Service is ClusterIP (not exposed externally), you access it through an SSH tunnel.

6.1 Set Up the SSH Tunnel

From your local machine, forward port 5000 through your EC2 bastion host:

ssh -L 5000:localhost:5000 -N <your-ec2-user>@<your-ec2-ip> &

Then, on the EC2 instance, forward from the Kubernetes cluster:

kubectl port-forward svc/mlflow-service 5000:5000 -n kubeflow

6.2 Open the UI

Navigate to http://localhost:5000 in your browser. You will see:

  • Experiments panel (left sidebar): Lists all experiments. Click yolov5-coco128-demo to see its runs.
  • Runs table: Each row is a training run. Columns show parameters, metrics, tags, and timestamps. Click a run name to see its detail page.
  • Run detail page: Shows all logged parameters, metrics, and tags. The "Artifacts" tab lets you browse and download uploaded files (logs, weights, plots).
  • Compare runs: Select multiple runs with the checkboxes, then click "Compare" to see parameters and metrics side by side.

7. Artifacts in MinIO

All artifacts uploaded via mlflow.log_artifact and mlflow.log_artifacts are stored in the MinIO bucket named mlflow. The directory structure inside the bucket mirrors the MLflow artifact paths:

s3://mlflow/
  <experiment-id>/
    <run-id>/
      artifacts/
        logs/
          train.log
        training-output/
          weights/
            best.pt
            last.pt
          results.csv
          results.png
          confusion_matrix.png
          ...

You can also browse these artifacts directly in the MinIO console if you have access, but the MLflow UI is the preferred interface since it links artifacts to their corresponding run metadata.


< Back to main README