This guide covers using Docker as your daily development environment for machine learning and deep learning work. This is not about building images for production -- it is about using Docker containers as your workspace every single day.
If you follow this workflow, you will never corrupt a shared workstation, never fight dependency conflicts, and never hear "it works on my machine."
Most people think of Docker as a deployment tool: build an image, push it, run it in production. But Docker is equally powerful as a development tool, especially on shared GPU workstations.
Problem 1: Dependency hell. Your project needs PyTorch 2.1 with CUDA 12.1. Your teammate's project needs PyTorch 1.13 with CUDA 11.8. Both of you are on the same machine. Without Docker, one of you loses.
Problem 2: System corruption.
Someone runs sudo pip install numpy and overwrites the system numpy. Now
the system package manager is broken. With Docker, nothing you do inside a
container affects the host.
Problem 3: "Works on my machine." You trained a model, got 94% accuracy, and handed off the code. Your teammate gets 87%. Why? Different package versions. Docker gives you an exact, reproducible environment.
Problem 4: Cleanup. You tried 15 different packages for an experiment. With Docker, you stop the container and everything is gone. Without Docker, those packages are scattered across your home directory or worse, the system.
Problem 5: Isolation on shared machines. Three people sharing a workstation. Each needs different CUDA toolkit versions, different Python versions, different library versions. Docker gives each person their own isolated environment with zero interference.
Virtual environments (venv, conda) solve only the Python package problem. Docker solves:
- Python packages (like venv)
- System libraries (apt packages, shared objects)
- CUDA toolkit version (independent of host CUDA)
- Environment variables
- Configuration files
- The entire filesystem layout
Virtual environments are fine for pure Python work. For ML/DL work involving CUDA, system libraries, and complex dependencies, Docker is strictly better.
A development container is different from a production container. It is designed for interactive use, fast iteration, and developer comfort.
Your production Dockerfile copies code into the image and runs a specific command. Your development Dockerfile sets up the environment and lets you work interactively.
# Dockerfile.dev -- development container
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
# Prevent interactive prompts during apt install
ENV DEBIAN_FRONTEND=noninteractive
# System packages
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
python3-venv \
git \
vim \
tmux \
curl \
wget \
htop \
tree \
&& rm -rf /var/lib/apt/lists/*
# Python ML packages (the common ones everyone needs)
RUN pip3 install --no-cache-dir \
torch==2.1.0 \
torchvision==0.16.0 \
numpy \
pandas \
scikit-learn \
matplotlib \
jupyterlab \
mlflow \
tensorboard \
tqdm \
pyyaml \
pillow \
opencv-python-headless
# Working directory
WORKDIR /workspace
# Default command: bash shell
CMD ["/bin/bash"]Compare this with a production Dockerfile:
# Dockerfile -- production container
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04
COPY requirements.txt /app/requirements.txt
RUN pip3 install --no-cache-dir -r /app/requirements.txt
COPY . /app
WORKDIR /app
CMD ["python3", "train.py"]Key differences:
- Dev uses
develbase (includes compilers), production usesruntime(smaller) - Dev installs developer tools (vim, tmux, htop), production does not
- Dev mounts code as a volume, production copies code into the image
- Dev drops you into bash, production runs a specific command
- Dev includes JupyterLab, production does not
Every flag you need to know:
# GPU access
--gpus all # Pass through all GPUs
--gpus '"device=0"' # Pass through only GPU 0
--gpus '"device=0,1"' # Pass through GPUs 0 and 1
# Volume mounts (your code stays on the host, shared into the container)
-v $(pwd):/workspace # Mount current directory as /workspace
-v /data/datasets:/data # Mount shared dataset directory (read-only: add :ro)
-v /data/datasets:/data:ro # Read-only mount for datasets
# Port mapping (access services inside the container from outside)
-p 8888:8888 # Jupyter
-p 6006:6006 # TensorBoard
-p 5000:5000 # MLflow
# Interactive mode
-it # Interactive terminal (required for bash)
# Detached mode
-d # Run in background
# Container naming
--name alice-yolo-exp1 # Name your container (easier than container IDs)
# Shared memory (CRITICAL for PyTorch DataLoader)
--shm-size=8g # Increase shared memory to 8 GB
# User mapping (match your host user ID)
--user $(id -u):$(id -g) # Run as your host user inside the container
# Environment variables
-e MLFLOW_TRACKING_URI=http://mlflow-server:5000
-e WANDB_API_KEY=your_key_here
# Network
--network host # Use host network (simplest, no port mapping needed)
--network my-bridge # Use a custom bridge network
# Resource limits
--cpus 8 # Limit to 8 CPU cores
--memory 32g # Limit to 32 GB RAM
# Hostname
--hostname dev-alice # Set container hostnameThis is the workflow you should follow every day.
Create this file once per project (or once per team and share it):
# Dockerfile.dev
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y \
python3 python3-pip python3-venv \
git vim tmux curl wget htop \
libgl1-mesa-glx libglib2.0-0 \
&& rm -rf /var/lib/apt/lists/*
RUN pip3 install --no-cache-dir \
torch==2.1.0 torchvision==0.16.0 \
numpy pandas scikit-learn matplotlib \
jupyterlab mlflow tensorboard \
tqdm pyyaml pillow opencv-python-headless
WORKDIR /workspace
CMD ["/bin/bash"]docker build -t myteam/ml-base:cuda12.1-py3.10 -f Dockerfile.dev .This takes 10-20 minutes the first time. After that, Docker caches each layer, so rebuilds are fast unless you change the Dockerfile.
docker run \
--gpus all \
-it \
--shm-size=8g \
-v $(pwd):/workspace \
-v /data/datasets:/data:ro \
-p 8888:8888 \
-p 6006:6006 \
--name alice-experiment-1 \
myteam/ml-base:cuda12.1-py3.10You are now inside the container with a bash prompt. Everything in your
current directory is visible at /workspace. All shared datasets are at
/data.
# Verify GPU access
nvidia-smi
python3 -c "import torch; print(torch.cuda.is_available())"
# Install experiment-specific packages (only inside this container)
pip3 install ultralytics albumentations
# Run your training
python3 train.py --epochs 50 --batch-size 16
# Or start Jupyter
jupyter lab --ip 0.0.0.0 --port 8888 --no-browser --allow-rootAny packages you install here exist only in this container. They do not affect the host or any other container.
# Inside the container:
exitThe container stops. Your code changes are preserved (they are on the host via the volume mount). Any packages you installed inside the container are gone. The host machine is exactly as clean as before.
To remove the stopped container:
docker rm alice-experiment-1Sometimes you want a container that persists across sessions. You install extra packages, configure things, and want to come back to the same state tomorrow.
# Start in detached mode (-d instead of -it)
docker run \
--gpus all \
-d \
--shm-size=8g \
-v $(pwd):/workspace \
-v /data/datasets:/data:ro \
-p 8888:8888 \
-p 6006:6006 \
--name dev-alice \
--restart unless-stopped \
myteam/ml-base:cuda12.1-py3.10 \
sleep infinityThe sleep infinity command keeps the container running. The --restart unless-stopped flag restarts it automatically if the machine reboots.
# Open a shell in the running container
docker exec -it dev-alice bash
# You can open multiple shells at once
# Terminal 1:
docker exec -it dev-alice bash
# Terminal 2:
docker exec -it dev-alice bash# Stop the container (preserves state)
docker stop dev-alice
# Start it again (state is preserved)
docker start dev-alice
# Check if it is running
docker ps | grep dev-alice
# View logs
docker logs dev-alice
# Remove it permanently (state is lost)
docker stop dev-alice && docker rm dev-aliceUse persistent containers when:
- You have a multi-day experiment
- You installed many packages and do not want to reinstall
- You are running a long training job in the background
- You want the same container to always be available
Use ephemeral containers when:
- You are trying something quick
- You want a clean environment every time
- You are testing whether your setup is reproducible
- You are done with an experiment and want to clean up
If you installed many packages in a persistent container and want to save that state as a new image:
docker commit dev-alice myteam/ml-alice-exp:v1Now you can create new containers from that snapshot. But prefer putting everything in a Dockerfile instead -- it is more reproducible.
For complex setups with multiple services (training + MLflow + Jupyter + TensorBoard), docker-compose is the way to go.
# docker-compose.yml
version: "3.8"
services:
# ----- Training / Development container -----
train:
build:
context: .
dockerfile: Dockerfile.dev
container_name: ml-train
command: sleep infinity
volumes:
- .:/workspace
- /data/datasets:/data:ro
- train-pip-cache:/root/.cache/pip
ports:
- "8888:8888" # Jupyter
- "6006:6006" # TensorBoard
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
- CUDA_VISIBLE_DEVICES=0
shm_size: "8g"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
depends_on:
- mlflow
# ----- MLflow tracking server -----
mlflow:
image: ghcr.io/mlflow/mlflow:v2.9.2
container_name: ml-mlflow
command: >
mlflow server
--host 0.0.0.0
--port 5000
--backend-store-uri sqlite:///mlflow/mlflow.db
--default-artifact-root /mlflow/artifacts
volumes:
- mlflow-data:/mlflow
ports:
- "5000:5000"
# ----- TensorBoard (optional, if you prefer separate) -----
tensorboard:
image: tensorflow/tensorflow:latest
container_name: ml-tensorboard
command: tensorboard --logdir /logs --host 0.0.0.0 --port 6007
volumes:
- ./runs:/logs
ports:
- "6007:6007"
# ----- MinIO (S3-compatible storage for artifacts) -----
minio:
image: minio/minio:latest
container_name: ml-minio
command: server /data --console-address ":9001"
volumes:
- minio-data:/data
ports:
- "9000:9000"
- "9001:9001"
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
volumes:
mlflow-data:
minio-data:
train-pip-cache:# Start everything
docker compose up -d
# Check status
docker compose ps
# Attach to the training container
docker exec -it ml-train bash
# View logs from all services
docker compose logs -f
# View logs from one service
docker compose logs -f mlflow
# Stop everything (preserves data)
docker compose stop
# Stop and remove everything (preserves named volumes)
docker compose down
# Stop, remove, and delete all data
docker compose down -vCommit the docker-compose.yml and Dockerfile.dev to your project repo.
Every team member clones the repo and runs docker compose up -d. Everyone
gets the exact same environment.
VS Code can connect directly to a Docker container and give you a full IDE experience inside it. This is the most comfortable development workflow.
-
Install VS Code extensions:
- "Remote - SSH" (for connecting to the workstation)
- "Dev Containers" (for connecting to Docker containers)
-
Create a
.devcontainer/devcontainer.jsonin your project:
{
"name": "ML Development",
"dockerFile": "../Dockerfile.dev",
"runArgs": [
"--gpus", "all",
"--shm-size=8g",
"-v", "/data/datasets:/data:ro"
],
"forwardPorts": [8888, 6006, 5000],
"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance",
"ms-toolsai.jupyter",
"mhutchie.git-graph"
],
"settings": {
"python.defaultInterpreterPath": "/usr/bin/python3",
"python.linting.enabled": true,
"terminal.integrated.defaultProfile.linux": "bash"
}
}
},
"postCreateCommand": "pip install -e .",
"remoteUser": "root",
"workspaceMount": "source=${localWorkspaceFolder},target=/workspace,type=bind",
"workspaceFolder": "/workspace"
}The full chain:
- Your laptop runs VS Code
- VS Code connects to the workstation via SSH (Remote - SSH extension)
- On the workstation, VS Code starts a Docker container (Dev Containers extension)
- You edit code as if it were local, but it runs inside the container on the GPU
This gives you: local IDE comfort + remote GPU power + Docker isolation.
- SSH into the workstation in VS Code (Remote-SSH: Connect to Host)
- Open your project folder
- VS Code detects
.devcontainer/devcontainer.jsonand prompts you - Click "Reopen in Container"
- VS Code rebuilds/starts the container and connects
Or manually: Command Palette > "Dev Containers: Reopen in Container"
Running JupyterLab inside a Docker container with GPU access.
# Option 1: Run Jupyter directly
docker run \
--gpus all \
-it \
--shm-size=8g \
-v $(pwd):/workspace \
-p 8888:8888 \
myteam/ml-base:cuda12.1-py3.10 \
jupyter lab --ip 0.0.0.0 --port 8888 --no-browser --allow-root
# Option 2: Start Jupyter inside an existing container
docker exec -it dev-alice \
jupyter lab --ip 0.0.0.0 --port 8888 --no-browser --allow-rootJupyter prints a URL with a token:
http://127.0.0.1:8888/lab?token=abc123def456...
If the workstation is remote:
# On your laptop:
ssh -L 8888:localhost:8888 user@workstation
# Then open in your browser:
# http://localhost:8888/lab?token=abc123def456...Because you mounted your project directory as a volume (-v $(pwd):/workspace),
any notebooks you create or modify inside /workspace are saved on the host.
They survive container restarts and removal.
Notebooks saved outside the mounted volume (for example, in /root or /tmp)
are lost when the container is removed.
# Generate a password hash
docker exec -it dev-alice python3 -c \
"from jupyter_server.auth import passwd; print(passwd('mypassword'))"
# Start Jupyter with the password
docker exec -it dev-alice jupyter lab \
--ip 0.0.0.0 \
--port 8888 \
--no-browser \
--allow-root \
--ServerApp.password='argon2:...(the hash from above)...'If you need multiple Python environments accessible from Jupyter:
# Inside the container, create a venv and register it as a kernel
python3 -m venv /opt/torch21
source /opt/torch21/bin/activate
pip install torch==2.1.0 ipykernel
python -m ipykernel install --name torch21 --display-name "PyTorch 2.1"
deactivate
# Now Jupyter shows "PyTorch 2.1" as a kernel optionThese issues catch everyone at some point. Read them now so you recognize the symptoms immediately.
By default, processes inside the container run as root. Files created inside the mounted volume are owned by root on the host.
# Problem: files created in container are owned by root
$ ls -la /workspace/output/
-rw-r--r-- 1 root root 1234 Mar 15 10:00 model.pt
# Solution 1: run container as your user
docker run --user $(id -u):$(id -g) ...
# Solution 2: fix ownership inside the container
chown -R $(id -u):$(id -g) /workspace/output/If you use --user, you may not be able to install packages with pip
(no write access to system directories). Workarounds:
- Install packages in the Dockerfile (preferred)
- Use
pip install --user(installs to ~/.local) - Mount a pip cache volume
PyTorch DataLoader with num_workers > 0 uses shared memory (/dev/shm).
Docker limits this to 64 MB by default. You will get cryptic errors if your
data loading exceeds this.
RuntimeError: DataLoader worker (pid 1234) is killed by signal: Bus error.
Fix:
# Always set --shm-size when using PyTorch
docker run --shm-size=8g ...
# Or in docker-compose:
services:
train:
shm_size: "8g"Sometimes containers cannot resolve hostnames (DNS issues).
# Test DNS
docker exec dev-alice nslookup google.com
# If it fails, try using host networking
docker run --network host ...
# Or set DNS explicitly
docker run --dns 8.8.8.8 --dns 8.8.4.4 ...Accessing host services from a container (e.g., MLflow running on host):
# On Linux, use the host gateway
docker run --add-host=host.docker.internal:host-gateway ...
# Then inside the container:
curl http://host.docker.internal:5000
# Or use host networking (simplest, no isolation)
docker run --network host ...Containers talking to each other:
# Create a network
docker network create ml-net
# Start containers on the same network
docker run --network ml-net --name mlflow-server ...
docker run --network ml-net --name training ...
# Inside the training container, reach mlflow by name:
curl http://mlflow-server:5000The NVIDIA driver on the host determines the maximum CUDA version you can use in containers. If you see errors like:
CUDA error: no kernel image is available for execution on the device
Check compatibility:
# Host driver version
nvidia-smi # Shows driver version and max CUDA version
# Container CUDA version
nvcc --version # Shows toolkit version inside containerThe rule: the container's CUDA toolkit version must be <= the host driver's supported CUDA version. Check the NVIDIA compatibility matrix: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
ML images are large. Docker can silently consume hundreds of gigabytes.
# Check Docker disk usage
docker system df
# Clean up unused images, containers, volumes
docker system prune -a
# Remove only dangling images
docker image prune
# Remove stopped containers
docker container prune
# Remove unused volumes (careful: this deletes data)
docker volume prune
# Set up a weekly cron job for cleanup
# Add to crontab (crontab -e):
0 3 * * 0 docker image prune -f --filter "until=168h"If nvidia-smi fails inside the container:
# Verify NVIDIA Container Toolkit is installed on host
nvidia-container-cli info
# Check Docker runtime configuration
docker info | grep -i runtime
# Ensure the nvidia runtime is available
# If not, install nvidia-container-toolkit:
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart dockerEnvironment variables set on the host are not automatically available inside containers. Pass them explicitly:
# Pass individual variables
docker run -e MY_VAR=value ...
# Pass from host environment
docker run -e MY_VAR ... # Uses the host's value of MY_VAR
# Pass from a file
docker run --env-file .env ...The --env-file approach is cleanest for many variables. Create a .env
file (and add it to .gitignore):
MLFLOW_TRACKING_URI=http://mlflow:5000
WANDB_API_KEY=abc123
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=secret...
docker run --gpus all -it --shm-size=8g \
-v $(pwd):/workspace \
--name exp-$(date +%Y%m%d-%H%M) \
myteam/ml-base:cuda12.1-py3.10docker exec -it container-name bashdocker psdocker ps -adocker stop container-name && docker rm container-namedocker exec container-name nvidia-smi# Host to container
docker cp ./file.txt container-name:/workspace/file.txt
# Container to host
docker cp container-name:/workspace/model.pt ./model.ptdocker statsThe daily workflow is:
- Build or reuse a base image (once per project/team)
- Run a container with GPU access, volume mounts, port mappings
- Work inside the container (train models, run notebooks)
- Exit when done -- the host is clean
For persistent work, use detached containers with docker exec. For
multi-service setups, use docker-compose. For the best IDE experience,
use VS Code Dev Containers.
The key insight: treat Docker containers as disposable development environments, not just deployment artifacts.