Kubeflow MCP Server

Proposal: KEP-936 · ROADMAP · SECURITY · CONTRIBUTING

Overview

The Kubeflow MCP Server exposes Kubeflow Training operations as Model Context Protocol tools, enabling AI agents (Claude, Cursor, Claude Code, or any custom agents etc.) to plan, submit, monitor, and manage training jobs through natural language — without users needing to learn Kubernetes or the Kubeflow SDK directly.

Benefits

Agent-Native: Tools auto-discovered via MCP — no manual API wiring
Guided Workflow: Phase ordering with next-step hints (Plan → Discover → Train → Monitor)
Preview-Before-Submit: Every mutating operation requires explicit confirmation
Security-First: Persona gating, namespace enforcement, input validation, bearer/JWT auth
Multi-Platform: Auto-detects OpenShift, EKS, GKE with platform-specific guidance
Token-Efficient: Progressive/semantic modes compress 23 tools into 2-3 meta-tools
Extensible: Plugin architecture for additional Kubeflow clients (TODO: optimizer, hub)

Get Started

Install from source

git clone https://github.com/kubeflow/mcp-server.git
cd mcp-server
pip install .

Run the server

kubeflow-mcp serve

Once published to PyPI, install with pip install kubeflow-mcp.

Example: Fine-tune a model via AI agent

Once connected, your AI agent can run a complete training workflow through natural language:

User: "Fine-tune gemma-2b on the alpaca dataset"

Agent calls: check_compatibility()        → ✅ K8s 1.29, Trainer CRD installed
Agent calls: get_cluster_resources()      → 4x A100 GPUs available
Agent calls: estimate_resources("google/gemma-2b") → needs ~16GB GPU, 1x A100
Agent calls: list_runtimes()              → torchtune-llama, torchtune-gemma, ...
Agent calls: fine_tune(                   → preview config (confirmed=False)
    model="hf://google/gemma-2b",
    dataset="hf://tatsu-lab/alpaca",
    runtime="torchtune-gemma-2b"
)
Agent calls: fine_tune(..., confirmed=True) → TrainJob "train-gemma-abc" created
Agent calls: get_training_logs("train-gemma-abc") → training progress...

Every mutating tool requires confirmed=True — agents always preview before submitting.

MCP Client Config

Cursor

Add to .cursor/mcp.json (or use the .mcp.json at the repo root for local dev):

{
  "mcpServers": {
    "kubeflow": {
      "command": "uv",
      "args": ["run", "kubeflow-mcp", "serve"]
    }
  }
}

Claude Code

claude mcp add kubeflow -- kubeflow-mcp serve

Tools

23 tools organized by workflow phase:

Phase	Tools	Description
Planning	`pre_flight`, `check_compatibility`, `get_cluster_resources`, `estimate_resources`	Environment validation and resource estimation
Discovery	`list_training_jobs`, `get_training_job`, `list_runtimes`, `get_runtime`	Browse jobs and available runtimes
Training	`fine_tune`, `run_custom_training`, `run_container_training`	Submit LoRA/QLoRA fine-tuning, custom scripts, or container jobs
Monitoring	`get_training_logs`, `get_training_events`, `wait_for_training`	Track progress, debug failures
Lifecycle	`delete_training_job`, `update_training_job`	Manage existing jobs (ownership-guarded)
Platform	`inspect_crd`, `inspect_controller`, `patch_runtime`, `create_runtime`, `delete_runtime`	Cluster inspection and runtime management
Health	`health_check`, `get_server_logs`	Server diagnostics

Requirements

MCP Server	Kubeflow Trainer	Kubeflow SDK	Python	Kubernetes
0.1.x	>= 2.2.0	>= 0.4.0	3.10 - 3.12	>= 1.27

CLI Reference