Skip to content

Latest commit

 

History

History
898 lines (666 loc) · 22.6 KB

File metadata and controls

898 lines (666 loc) · 22.6 KB

Docker Development Workflow for ML/DL

This guide covers using Docker as your daily development environment for machine learning and deep learning work. This is not about building images for production -- it is about using Docker containers as your workspace every single day.

If you follow this workflow, you will never corrupt a shared workstation, never fight dependency conflicts, and never hear "it works on my machine."

Why Docker for Daily Development

Most people think of Docker as a deployment tool: build an image, push it, run it in production. But Docker is equally powerful as a development tool, especially on shared GPU workstations.

The problems Docker solves for ML development

Problem 1: Dependency hell. Your project needs PyTorch 2.1 with CUDA 12.1. Your teammate's project needs PyTorch 1.13 with CUDA 11.8. Both of you are on the same machine. Without Docker, one of you loses.

Problem 2: System corruption. Someone runs sudo pip install numpy and overwrites the system numpy. Now the system package manager is broken. With Docker, nothing you do inside a container affects the host.

Problem 3: "Works on my machine." You trained a model, got 94% accuracy, and handed off the code. Your teammate gets 87%. Why? Different package versions. Docker gives you an exact, reproducible environment.

Problem 4: Cleanup. You tried 15 different packages for an experiment. With Docker, you stop the container and everything is gone. Without Docker, those packages are scattered across your home directory or worse, the system.

Problem 5: Isolation on shared machines. Three people sharing a workstation. Each needs different CUDA toolkit versions, different Python versions, different library versions. Docker gives each person their own isolated environment with zero interference.

Docker vs virtual environments

Virtual environments (venv, conda) solve only the Python package problem. Docker solves:

  • Python packages (like venv)
  • System libraries (apt packages, shared objects)
  • CUDA toolkit version (independent of host CUDA)
  • Environment variables
  • Configuration files
  • The entire filesystem layout

Virtual environments are fine for pure Python work. For ML/DL work involving CUDA, system libraries, and complex dependencies, Docker is strictly better.

The Development Container Pattern

A development container is different from a production container. It is designed for interactive use, fast iteration, and developer comfort.

Dockerfile.dev vs production Dockerfile

Your production Dockerfile copies code into the image and runs a specific command. Your development Dockerfile sets up the environment and lets you work interactively.

# Dockerfile.dev -- development container
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04

# Prevent interactive prompts during apt install
ENV DEBIAN_FRONTEND=noninteractive

# System packages
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    python3-venv \
    git \
    vim \
    tmux \
    curl \
    wget \
    htop \
    tree \
    && rm -rf /var/lib/apt/lists/*

# Python ML packages (the common ones everyone needs)
RUN pip3 install --no-cache-dir \
    torch==2.1.0 \
    torchvision==0.16.0 \
    numpy \
    pandas \
    scikit-learn \
    matplotlib \
    jupyterlab \
    mlflow \
    tensorboard \
    tqdm \
    pyyaml \
    pillow \
    opencv-python-headless

# Working directory
WORKDIR /workspace

# Default command: bash shell
CMD ["/bin/bash"]

Compare this with a production Dockerfile:

# Dockerfile -- production container
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04

COPY requirements.txt /app/requirements.txt
RUN pip3 install --no-cache-dir -r /app/requirements.txt
COPY . /app
WORKDIR /app
CMD ["python3", "train.py"]

Key differences:

  • Dev uses devel base (includes compilers), production uses runtime (smaller)
  • Dev installs developer tools (vim, tmux, htop), production does not
  • Dev mounts code as a volume, production copies code into the image
  • Dev drops you into bash, production runs a specific command
  • Dev includes JupyterLab, production does not

Core Docker flags for ML development

Every flag you need to know:

# GPU access
--gpus all                    # Pass through all GPUs
--gpus '"device=0"'           # Pass through only GPU 0
--gpus '"device=0,1"'         # Pass through GPUs 0 and 1

# Volume mounts (your code stays on the host, shared into the container)
-v $(pwd):/workspace          # Mount current directory as /workspace
-v /data/datasets:/data       # Mount shared dataset directory (read-only: add :ro)
-v /data/datasets:/data:ro    # Read-only mount for datasets

# Port mapping (access services inside the container from outside)
-p 8888:8888                  # Jupyter
-p 6006:6006                  # TensorBoard
-p 5000:5000                  # MLflow

# Interactive mode
-it                           # Interactive terminal (required for bash)

# Detached mode
-d                            # Run in background

# Container naming
--name alice-yolo-exp1        # Name your container (easier than container IDs)

# Shared memory (CRITICAL for PyTorch DataLoader)
--shm-size=8g                 # Increase shared memory to 8 GB

# User mapping (match your host user ID)
--user $(id -u):$(id -g)      # Run as your host user inside the container

# Environment variables
-e MLFLOW_TRACKING_URI=http://mlflow-server:5000
-e WANDB_API_KEY=your_key_here

# Network
--network host                # Use host network (simplest, no port mapping needed)
--network my-bridge           # Use a custom bridge network

# Resource limits
--cpus 8                      # Limit to 8 CPU cores
--memory 32g                  # Limit to 32 GB RAM

# Hostname
--hostname dev-alice          # Set container hostname

Complete Example Workflow

This is the workflow you should follow every day.

Step 1: Write your Dockerfile.dev

Create this file once per project (or once per team and share it):

# Dockerfile.dev
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
    python3 python3-pip python3-venv \
    git vim tmux curl wget htop \
    libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --no-cache-dir \
    torch==2.1.0 torchvision==0.16.0 \
    numpy pandas scikit-learn matplotlib \
    jupyterlab mlflow tensorboard \
    tqdm pyyaml pillow opencv-python-headless

WORKDIR /workspace
CMD ["/bin/bash"]

Step 2: Build the image

docker build -t myteam/ml-base:cuda12.1-py3.10 -f Dockerfile.dev .

This takes 10-20 minutes the first time. After that, Docker caches each layer, so rebuilds are fast unless you change the Dockerfile.

Step 3: Run the container

docker run \
    --gpus all \
    -it \
    --shm-size=8g \
    -v $(pwd):/workspace \
    -v /data/datasets:/data:ro \
    -p 8888:8888 \
    -p 6006:6006 \
    --name alice-experiment-1 \
    myteam/ml-base:cuda12.1-py3.10

You are now inside the container with a bash prompt. Everything in your current directory is visible at /workspace. All shared datasets are at /data.

Step 4: Work inside the container

# Verify GPU access
nvidia-smi
python3 -c "import torch; print(torch.cuda.is_available())"

# Install experiment-specific packages (only inside this container)
pip3 install ultralytics albumentations

# Run your training
python3 train.py --epochs 50 --batch-size 16

# Or start Jupyter
jupyter lab --ip 0.0.0.0 --port 8888 --no-browser --allow-root

Any packages you install here exist only in this container. They do not affect the host or any other container.

Step 5: When you are done

# Inside the container:
exit

The container stops. Your code changes are preserved (they are on the host via the volume mount). Any packages you installed inside the container are gone. The host machine is exactly as clean as before.

To remove the stopped container:

docker rm alice-experiment-1

Persistent Development Containers

Sometimes you want a container that persists across sessions. You install extra packages, configure things, and want to come back to the same state tomorrow.

Creating a persistent container

# Start in detached mode (-d instead of -it)
docker run \
    --gpus all \
    -d \
    --shm-size=8g \
    -v $(pwd):/workspace \
    -v /data/datasets:/data:ro \
    -p 8888:8888 \
    -p 6006:6006 \
    --name dev-alice \
    --restart unless-stopped \
    myteam/ml-base:cuda12.1-py3.10 \
    sleep infinity

The sleep infinity command keeps the container running. The --restart unless-stopped flag restarts it automatically if the machine reboots.

Attaching to your persistent container

# Open a shell in the running container
docker exec -it dev-alice bash

# You can open multiple shells at once
# Terminal 1:
docker exec -it dev-alice bash
# Terminal 2:
docker exec -it dev-alice bash

Managing your persistent container

# Stop the container (preserves state)
docker stop dev-alice

# Start it again (state is preserved)
docker start dev-alice

# Check if it is running
docker ps | grep dev-alice

# View logs
docker logs dev-alice

# Remove it permanently (state is lost)
docker stop dev-alice && docker rm dev-alice

When to use persistent vs ephemeral containers

Use persistent containers when:

  • You have a multi-day experiment
  • You installed many packages and do not want to reinstall
  • You are running a long training job in the background
  • You want the same container to always be available

Use ephemeral containers when:

  • You are trying something quick
  • You want a clean environment every time
  • You are testing whether your setup is reproducible
  • You are done with an experiment and want to clean up

Committing container state to an image

If you installed many packages in a persistent container and want to save that state as a new image:

docker commit dev-alice myteam/ml-alice-exp:v1

Now you can create new containers from that snapshot. But prefer putting everything in a Dockerfile instead -- it is more reproducible.

docker-compose for ML Development

For complex setups with multiple services (training + MLflow + Jupyter + TensorBoard), docker-compose is the way to go.

Complete docker-compose.yml for ML development

# docker-compose.yml
version: "3.8"

services:
  # ----- Training / Development container -----
  train:
    build:
      context: .
      dockerfile: Dockerfile.dev
    container_name: ml-train
    command: sleep infinity
    volumes:
      - .:/workspace
      - /data/datasets:/data:ro
      - train-pip-cache:/root/.cache/pip
    ports:
      - "8888:8888"   # Jupyter
      - "6006:6006"   # TensorBoard
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - CUDA_VISIBLE_DEVICES=0
    shm_size: "8g"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    depends_on:
      - mlflow

  # ----- MLflow tracking server -----
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.9.2
    container_name: ml-mlflow
    command: >
      mlflow server
      --host 0.0.0.0
      --port 5000
      --backend-store-uri sqlite:///mlflow/mlflow.db
      --default-artifact-root /mlflow/artifacts
    volumes:
      - mlflow-data:/mlflow
    ports:
      - "5000:5000"

  # ----- TensorBoard (optional, if you prefer separate) -----
  tensorboard:
    image: tensorflow/tensorflow:latest
    container_name: ml-tensorboard
    command: tensorboard --logdir /logs --host 0.0.0.0 --port 6007
    volumes:
      - ./runs:/logs
    ports:
      - "6007:6007"

  # ----- MinIO (S3-compatible storage for artifacts) -----
  minio:
    image: minio/minio:latest
    container_name: ml-minio
    command: server /data --console-address ":9001"
    volumes:
      - minio-data:/data
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin

volumes:
  mlflow-data:
  minio-data:
  train-pip-cache:

Using docker-compose

# Start everything
docker compose up -d

# Check status
docker compose ps

# Attach to the training container
docker exec -it ml-train bash

# View logs from all services
docker compose logs -f

# View logs from one service
docker compose logs -f mlflow

# Stop everything (preserves data)
docker compose stop

# Stop and remove everything (preserves named volumes)
docker compose down

# Stop, remove, and delete all data
docker compose down -v

Sharing the compose file

Commit the docker-compose.yml and Dockerfile.dev to your project repo. Every team member clones the repo and runs docker compose up -d. Everyone gets the exact same environment.

VS Code Dev Containers Integration

VS Code can connect directly to a Docker container and give you a full IDE experience inside it. This is the most comfortable development workflow.

Setup

  1. Install VS Code extensions:

    • "Remote - SSH" (for connecting to the workstation)
    • "Dev Containers" (for connecting to Docker containers)
  2. Create a .devcontainer/devcontainer.json in your project:

{
    "name": "ML Development",
    "dockerFile": "../Dockerfile.dev",
    "runArgs": [
        "--gpus", "all",
        "--shm-size=8g",
        "-v", "/data/datasets:/data:ro"
    ],
    "forwardPorts": [8888, 6006, 5000],
    "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python",
                "ms-python.vscode-pylance",
                "ms-toolsai.jupyter",
                "mhutchie.git-graph"
            ],
            "settings": {
                "python.defaultInterpreterPath": "/usr/bin/python3",
                "python.linting.enabled": true,
                "terminal.integrated.defaultProfile.linux": "bash"
            }
        }
    },
    "postCreateCommand": "pip install -e .",
    "remoteUser": "root",
    "workspaceMount": "source=${localWorkspaceFolder},target=/workspace,type=bind",
    "workspaceFolder": "/workspace"
}

How it works with Remote SSH + Docker

The full chain:

  1. Your laptop runs VS Code
  2. VS Code connects to the workstation via SSH (Remote - SSH extension)
  3. On the workstation, VS Code starts a Docker container (Dev Containers extension)
  4. You edit code as if it were local, but it runs inside the container on the GPU

This gives you: local IDE comfort + remote GPU power + Docker isolation.

Opening a project in a dev container

  1. SSH into the workstation in VS Code (Remote-SSH: Connect to Host)
  2. Open your project folder
  3. VS Code detects .devcontainer/devcontainer.json and prompts you
  4. Click "Reopen in Container"
  5. VS Code rebuilds/starts the container and connects

Or manually: Command Palette > "Dev Containers: Reopen in Container"

Jupyter Inside Docker

Running JupyterLab inside a Docker container with GPU access.

Starting JupyterLab in a container

# Option 1: Run Jupyter directly
docker run \
    --gpus all \
    -it \
    --shm-size=8g \
    -v $(pwd):/workspace \
    -p 8888:8888 \
    myteam/ml-base:cuda12.1-py3.10 \
    jupyter lab --ip 0.0.0.0 --port 8888 --no-browser --allow-root

# Option 2: Start Jupyter inside an existing container
docker exec -it dev-alice \
    jupyter lab --ip 0.0.0.0 --port 8888 --no-browser --allow-root

Jupyter prints a URL with a token:

http://127.0.0.1:8888/lab?token=abc123def456...

Accessing from your laptop via SSH tunnel

If the workstation is remote:

# On your laptop:
ssh -L 8888:localhost:8888 user@workstation

# Then open in your browser:
# http://localhost:8888/lab?token=abc123def456...

Persisting notebooks

Because you mounted your project directory as a volume (-v $(pwd):/workspace), any notebooks you create or modify inside /workspace are saved on the host. They survive container restarts and removal.

Notebooks saved outside the mounted volume (for example, in /root or /tmp) are lost when the container is removed.

Jupyter with a password instead of a token

# Generate a password hash
docker exec -it dev-alice python3 -c \
    "from jupyter_server.auth import passwd; print(passwd('mypassword'))"

# Start Jupyter with the password
docker exec -it dev-alice jupyter lab \
    --ip 0.0.0.0 \
    --port 8888 \
    --no-browser \
    --allow-root \
    --ServerApp.password='argon2:...(the hash from above)...'

Installing Jupyter kernels

If you need multiple Python environments accessible from Jupyter:

# Inside the container, create a venv and register it as a kernel
python3 -m venv /opt/torch21
source /opt/torch21/bin/activate
pip install torch==2.1.0 ipykernel
python -m ipykernel install --name torch21 --display-name "PyTorch 2.1"
deactivate

# Now Jupyter shows "PyTorch 2.1" as a kernel option

Common Gotchas

These issues catch everyone at some point. Read them now so you recognize the symptoms immediately.

File permissions (container user vs host user)

By default, processes inside the container run as root. Files created inside the mounted volume are owned by root on the host.

# Problem: files created in container are owned by root
$ ls -la /workspace/output/
-rw-r--r-- 1 root root 1234 Mar 15 10:00 model.pt

# Solution 1: run container as your user
docker run --user $(id -u):$(id -g) ...

# Solution 2: fix ownership inside the container
chown -R $(id -u):$(id -g) /workspace/output/

If you use --user, you may not be able to install packages with pip (no write access to system directories). Workarounds:

  • Install packages in the Dockerfile (preferred)
  • Use pip install --user (installs to ~/.local)
  • Mount a pip cache volume

Shared memory for PyTorch DataLoader

PyTorch DataLoader with num_workers > 0 uses shared memory (/dev/shm). Docker limits this to 64 MB by default. You will get cryptic errors if your data loading exceeds this.

RuntimeError: DataLoader worker (pid 1234) is killed by signal: Bus error.

Fix:

# Always set --shm-size when using PyTorch
docker run --shm-size=8g ...

# Or in docker-compose:
services:
  train:
    shm_size: "8g"

DNS resolution inside containers

Sometimes containers cannot resolve hostnames (DNS issues).

# Test DNS
docker exec dev-alice nslookup google.com

# If it fails, try using host networking
docker run --network host ...

# Or set DNS explicitly
docker run --dns 8.8.8.8 --dns 8.8.4.4 ...

Container networking

Accessing host services from a container (e.g., MLflow running on host):

# On Linux, use the host gateway
docker run --add-host=host.docker.internal:host-gateway ...

# Then inside the container:
curl http://host.docker.internal:5000

# Or use host networking (simplest, no isolation)
docker run --network host ...

Containers talking to each other:

# Create a network
docker network create ml-net

# Start containers on the same network
docker run --network ml-net --name mlflow-server ...
docker run --network ml-net --name training ...

# Inside the training container, reach mlflow by name:
curl http://mlflow-server:5000

NVIDIA driver vs CUDA toolkit version mismatch

The NVIDIA driver on the host determines the maximum CUDA version you can use in containers. If you see errors like:

CUDA error: no kernel image is available for execution on the device

Check compatibility:

# Host driver version
nvidia-smi   # Shows driver version and max CUDA version

# Container CUDA version
nvcc --version   # Shows toolkit version inside container

The rule: the container's CUDA toolkit version must be <= the host driver's supported CUDA version. Check the NVIDIA compatibility matrix: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Docker eating disk space

ML images are large. Docker can silently consume hundreds of gigabytes.

# Check Docker disk usage
docker system df

# Clean up unused images, containers, volumes
docker system prune -a

# Remove only dangling images
docker image prune

# Remove stopped containers
docker container prune

# Remove unused volumes (careful: this deletes data)
docker volume prune

# Set up a weekly cron job for cleanup
# Add to crontab (crontab -e):
0 3 * * 0 docker image prune -f --filter "until=168h"

GPU not visible inside container

If nvidia-smi fails inside the container:

# Verify NVIDIA Container Toolkit is installed on host
nvidia-container-cli info

# Check Docker runtime configuration
docker info | grep -i runtime

# Ensure the nvidia runtime is available
# If not, install nvidia-container-toolkit:
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Environment variables not propagating

Environment variables set on the host are not automatically available inside containers. Pass them explicitly:

# Pass individual variables
docker run -e MY_VAR=value ...

# Pass from host environment
docker run -e MY_VAR ...    # Uses the host's value of MY_VAR

# Pass from a file
docker run --env-file .env ...

The --env-file approach is cleanest for many variables. Create a .env file (and add it to .gitignore):

MLFLOW_TRACKING_URI=http://mlflow:5000
WANDB_API_KEY=abc123
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=secret...

Quick Reference

Start a new experiment

docker run --gpus all -it --shm-size=8g \
    -v $(pwd):/workspace \
    --name exp-$(date +%Y%m%d-%H%M) \
    myteam/ml-base:cuda12.1-py3.10

Attach to a running container

docker exec -it container-name bash

List running containers

docker ps

List all containers (including stopped)

docker ps -a

Stop and remove a container

docker stop container-name && docker rm container-name

Check GPU usage inside a container

docker exec container-name nvidia-smi

Copy files between host and container

# Host to container
docker cp ./file.txt container-name:/workspace/file.txt

# Container to host
docker cp container-name:/workspace/model.pt ./model.pt

View container resource usage

docker stats

Summary

The daily workflow is:

  1. Build or reuse a base image (once per project/team)
  2. Run a container with GPU access, volume mounts, port mappings
  3. Work inside the container (train models, run notebooks)
  4. Exit when done -- the host is clean

For persistent work, use detached containers with docker exec. For multi-service setups, use docker-compose. For the best IDE experience, use VS Code Dev Containers.

The key insight: treat Docker containers as disposable development environments, not just deployment artifacts.