Skip to content

Latest commit

 

History

History

README.md

< Back to Learning Path


Module 3: ML Workstation Setup

This module covers setting up and using shared GPU workstations for ML development. It is specifically designed for teams where multiple people share one or more machines with NVIDIA GPUs.

A shared GPU workstation is often the most cost-effective way to do ML work. Cloud GPU instances cost $0.50-3.00/hour. A desktop workstation with an RTX 4090 pays for itself in weeks of heavy use. But sharing a machine requires discipline -- one person's careless sudo pip install can break everyone's environment.

This module teaches you how to set up a workstation correctly from day one, how to work remotely on it, how to use Docker as your daily development environment, how to share GPUs across a team, and how to run a complete MLOps stack locally. It is designed to be thorough enough to follow entirely on your own, without an instructor.

Estimated time: 3-4 hours (read the guides, then work through the exercises).

Topics

File What You Will Learn Time
system-setup-guide.md Platform-specific setup (Linux, macOS, Windows) 1 hour
remote-development-setup.md SSH, VS Code Remote, port forwarding, tmux 45 min
workstation-best-practices.md Golden rules for shared GPU workstations 30 min
docker-development-workflow.md Docker as your daily ML dev environment 45 min
gpu-sharing-guide.md Sharing GPUs across a team, monitoring, MIG 30 min
team-image-management.md Building, naming, storing, cleaning images 20 min
local-mlops-stack.md MLflow + MinIO + K3s locally, no cloud needed 30 min
adding-new-workstations.md Bringing a new GPU machine online end to end 20 min
exercises.md Hands-on practice with workstation workflow 45 min

Prerequisites

  • Completed Modules 0-1 (Python venvs, bash, Docker basics)
  • Access to a Linux machine with an NVIDIA GPU (or plans to set one up)
  • An SSH client on your laptop (built-in on macOS/Linux, available on Windows)

Who Needs This Module

Definitely read this if:

  • Your team shares a GPU workstation
  • You are setting up a new GPU machine for ML work
  • You work remotely and connect to a GPU machine via SSH

You can skip this if:

  • You only use cloud GPU instances (AWS, GCP, Azure)
  • You have a personal GPU workstation that no one else uses
  • You already have an established workstation workflow

How This Connects to the Pipeline

This repo is designed to run on AWS EC2 (a cloud GPU instance). But the same Docker images, Kubernetes manifests, and MLflow setup work on a local GPU workstation. Module 3 bridges the gap:

  • The Dockerfile builds the same training image on your workstation
  • K3s can be installed locally instead of on EC2
  • MLflow can run locally instead of on a cloud server
  • You can develop and test locally, then deploy to AWS for production runs

The workstation is your development environment. AWS is your production environment. The tools (Docker, K3s, MLflow) are the same in both.

Checklist Before Moving to Module 4

  • You can SSH into a remote machine and run commands
  • Docker is installed on the GPU workstation with NVIDIA support
  • You understand why global pip installs are dangerous on shared machines
  • You can forward ports over SSH (for MLflow, Jupyter, etc.)
  • You have VS Code Remote SSH or equivalent set up
  • You can run a training job inside Docker on the workstation
  • You can build and run a development container with GPU access
  • You know how to check GPU availability and claim a specific GPU
  • You can run JupyterLab inside a Docker container and access it from your laptop
  • You have a local MLflow server running (docker-compose or standalone)
  • You understand image naming conventions and how to use a local registry
  • You could set up a new GPU workstation from scratch using the guide

< Back to Learning Path