This module covers setting up and using shared GPU workstations for ML development. It is specifically designed for teams where multiple people share one or more machines with NVIDIA GPUs.
A shared GPU workstation is often the most cost-effective way to do ML work.
Cloud GPU instances cost $0.50-3.00/hour. A desktop workstation with an RTX
4090 pays for itself in weeks of heavy use. But sharing a machine requires
discipline -- one person's careless sudo pip install can break everyone's
environment.
This module teaches you how to set up a workstation correctly from day one, how to work remotely on it, how to use Docker as your daily development environment, how to share GPUs across a team, and how to run a complete MLOps stack locally. It is designed to be thorough enough to follow entirely on your own, without an instructor.
Estimated time: 3-4 hours (read the guides, then work through the exercises).
| File | What You Will Learn | Time |
|---|---|---|
| system-setup-guide.md | Platform-specific setup (Linux, macOS, Windows) | 1 hour |
| remote-development-setup.md | SSH, VS Code Remote, port forwarding, tmux | 45 min |
| workstation-best-practices.md | Golden rules for shared GPU workstations | 30 min |
| docker-development-workflow.md | Docker as your daily ML dev environment | 45 min |
| gpu-sharing-guide.md | Sharing GPUs across a team, monitoring, MIG | 30 min |
| team-image-management.md | Building, naming, storing, cleaning images | 20 min |
| local-mlops-stack.md | MLflow + MinIO + K3s locally, no cloud needed | 30 min |
| adding-new-workstations.md | Bringing a new GPU machine online end to end | 20 min |
| exercises.md | Hands-on practice with workstation workflow | 45 min |
- Completed Modules 0-1 (Python venvs, bash, Docker basics)
- Access to a Linux machine with an NVIDIA GPU (or plans to set one up)
- An SSH client on your laptop (built-in on macOS/Linux, available on Windows)
- Your team shares a GPU workstation
- You are setting up a new GPU machine for ML work
- You work remotely and connect to a GPU machine via SSH
- You only use cloud GPU instances (AWS, GCP, Azure)
- You have a personal GPU workstation that no one else uses
- You already have an established workstation workflow
This repo is designed to run on AWS EC2 (a cloud GPU instance). But the same Docker images, Kubernetes manifests, and MLflow setup work on a local GPU workstation. Module 3 bridges the gap:
- The
Dockerfilebuilds the same training image on your workstation - K3s can be installed locally instead of on EC2
- MLflow can run locally instead of on a cloud server
- You can develop and test locally, then deploy to AWS for production runs
The workstation is your development environment. AWS is your production environment. The tools (Docker, K3s, MLflow) are the same in both.
- You can SSH into a remote machine and run commands
- Docker is installed on the GPU workstation with NVIDIA support
- You understand why global pip installs are dangerous on shared machines
- You can forward ports over SSH (for MLflow, Jupyter, etc.)
- You have VS Code Remote SSH or equivalent set up
- You can run a training job inside Docker on the workstation
- You can build and run a development container with GPU access
- You know how to check GPU availability and claim a specific GPU
- You can run JupyterLab inside a Docker container and access it from your laptop
- You have a local MLflow server running (docker-compose or standalone)
- You understand image naming conventions and how to use a local registry
- You could set up a new GPU workstation from scratch using the guide