Developing mlx-stack

A comprehensive guide for contributors working on the mlx-stack codebase.

Prerequisites
Getting Started
Project Architecture
Key Concepts
Adding a New Model
Adding a New CLI Command
Testing
Code Quality
Configuration System
Process Management
Commit Conventions
PR Process

Prerequisites

Requirement	Version	Notes
Python	3.13+	Apple Silicon native build recommended
uv	0.10+	Fast Python package manager — install guide
macOS	Apple Silicon (M1+)	Intel Macs and Linux are not supported
Git	2.30+	For version control

mlx-stack is designed exclusively for Apple Silicon Macs. Hardware detection, model serving (vllm-mlx), and performance benchmarks all depend on the Metal GPU and unified memory architecture.

Getting Started

1. Clone the repository

git clone https://github.com/weklund/mlx-stack.git
cd mlx-stack

2. Install dependencies

uv sync

This creates a virtual environment in .venv/ and installs all runtime and dev dependencies (pytest, pyright, ruff, etc.) from pyproject.toml.

3. Verify the installation

uv run mlx-stack --help

You should see the Rich-formatted help output listing all commands grouped by category (Setup & Configuration, Model Management, Stack Lifecycle, Diagnostics).

4. Run the tests

uv run pytest

All unit tests should pass. Integration tests are excluded by default (see Testing for details).

Project Architecture

src/mlx_stack/
├── __init__.py              # Package root — exports __version__
├── py.typed                 # PEP 561 marker for type checking
├── cli/                     # CLI layer — thin Click wrappers
│   ├── __init__.py          # Exports the cli() entry point
│   ├── main.py              # Click group, --help/--version, typo suggestions
│   ├── profile.py           # mlx-stack profile
│   ├── config.py            # mlx-stack config (set/get/list/reset)
│   ├── recommend.py         # mlx-stack recommend
│   ├── init.py              # mlx-stack init
│   ├── models.py            # mlx-stack models
│   ├── pull.py              # mlx-stack pull
│   ├── up.py                # mlx-stack up
│   ├── down.py              # mlx-stack down
│   ├── status.py            # mlx-stack status
│   ├── bench.py             # mlx-stack bench
│   ├── logs.py              # mlx-stack logs
│   ├── watch.py             # mlx-stack watch
│   └── install.py           # mlx-stack install / uninstall
├── core/                    # Business logic — importable without CLI
│   ├── __init__.py
│   ├── paths.py             # Path management (MLX_STACK_HOME, data dirs)
│   ├── hardware.py          # Apple Silicon detection, bandwidth lookup
│   ├── catalog.py           # YAML catalog loading, validation, querying
│   ├── config.py            # ConfigKeyDef system, get/set/validate/persist
│   ├── deps.py              # Dependency management (vllm-mlx, litellm)
│   ├── scoring.py           # Recommendation engine, tier assignment
│   ├── stack_init.py        # Stack definition + LiteLLM config generation
│   ├── litellm_gen.py       # LiteLLM YAML config builder
│   ├── models.py            # Local model inventory management
│   ├── process.py           # PID files, lockfile, health checks, start/stop
│   ├── stack_up.py          # Orchestrates starting all services
│   ├── stack_down.py        # Orchestrates stopping all services
│   ├── stack_status.py      # 5-state health reporting
│   ├── pull.py              # Model download with HuggingFace Hub
│   ├── benchmark.py         # Benchmarking engine (prompt/gen TPS)
│   ├── log_rotation.py      # Copytruncate log rotation
│   ├── log_viewer.py        # Log viewing, following, archive reading
│   ├── watchdog.py          # Health monitor with auto-restart
│   └── launchd.py           # macOS LaunchAgent integration
├── data/                    # Static data shipped with the package
│   ├── __init__.py
│   └── catalog/             # 15 model YAML files
│       ├── qwen3.5-0.8b.yaml
│       ├── qwen3.5-3b.yaml
│       ├── qwen3.5-8b.yaml
│       ├── qwen3.5-14b.yaml
│       ├── qwen3.5-32b.yaml
│       ├── qwen3.5-72b.yaml
│       ├── gemma3-4b.yaml
│       ├── gemma3-12b.yaml
│       ├── gemma3-27b.yaml
│       ├── deepseek-r1-8b.yaml
│       ├── deepseek-r1-32b.yaml
│       ├── nemotron-8b.yaml
│       ├── nemotron-49b.yaml
│       ├── qwen3-8b.yaml
│       └── llama3.3-8b.yaml
└── utils/                   # Shared utility modules
    └── __init__.py

tests/
├── conftest.py              # Shared fixtures (mlx_stack_home, etc.)
├── unit/                    # Unit tests — mocked external calls
│   ├── test_hardware.py
│   ├── test_catalog.py
│   ├── test_config.py
│   ├── test_deps.py
│   ├── test_scoring.py
│   ├── test_process.py
│   ├── test_cli_profile.py
│   ├── test_cli_config.py
│   ├── test_cli_recommend.py
│   ├── test_cli_init.py
│   ├── test_cli_models.py
│   ├── test_cli_up.py
│   ├── test_cli_down.py
│   ├── test_cli_status.py
│   ├── test_cli_pull.py
│   ├── test_cli_bench.py
│   ├── test_cli_logs.py
│   ├── test_cli_watch.py
│   ├── test_cli_install.py
│   └── ...                  # Cross-area, robustness, lifecycle tests
├── integration/             # Real-system integration tests
│   ├── test_inference_e2e.py
│   └── test_launchd_e2e.py
└── fixtures/                # Shared test data

Design Principles

CLI modules are thin wrappers. Each file in cli/ defines a Click command that parses arguments, calls into core/, and formats output with Rich. No business logic lives in cli/.
core/ has all business logic. Every module in core/ is importable and testable independently of the CLI layer. This makes unit testing straightforward — you test core/ functions directly.
data/catalog/ holds curated model YAMLs. These are loaded at runtime via importlib.resources so they work correctly whether installed as a package or run from source.
All state lives in ~/.mlx-stack/. The data directory (overridable via MLX_STACK_HOME env var) contains profile.json, config.yaml, stacks/, models/, pids/, logs/, and benchmarks/.

Key Concepts

Tiers

mlx-stack assigns models to three tiers, each optimized for a different workload:

Tier	Port	Purpose
`standard`	8000	Highest-quality model within memory budget
`fast`	8001	Fastest model for latency-sensitive tasks
`longctx`	8002	Architecturally diverse model (e.g., Mamba2 hybrid)

Tier assignment is performed by the scoring engine in core/scoring.py. The standard tier gets the model with the highest composite score weighted toward quality, fast gets the highest speed, and longctx prefers architecturally different models when available.

Catalog Entries

Each model in data/catalog/ is a YAML file describing:

Identity: id, name, family, params_b, architecture
Sources: Per-quantization HuggingFace repos with disk_size_gb
Capabilities: tool_calling, thinking, vision, parser names
Quality scores: overall, coding, reasoning, instruction_following (0–100)
Benchmarks: Per-hardware prompt_tps, gen_tps, memory_gb
Tags: Searchable labels like balanced, agent-ready, thinking

Stack Definitions

A stack definition (~/.mlx-stack/stacks/default.yaml) is the output of mlx-stack init. It specifies:

schema_version: 1 — for forward compatibility
hardware_profile — the profile ID (e.g., m4-pro-48)
intent — the optimization strategy used (balanced or agent-fleet)
tiers — list of tier entries, each with name, model, quant, source, port, and vllm_flags

A companion ~/.mlx-stack/litellm.yaml is generated alongside it for the LiteLLM proxy.

Process Management

mlx-stack manages vllm-mlx and LiteLLM as subprocesses:

PID files in ~/.mlx-stack/pids/ track each running service (e.g., fast.pid, litellm.pid). Each file contains a single integer PID.
Lockfile at ~/.mlx-stack/lock uses fcntl.flock to prevent concurrent up/down operations.
Health checks poll each service's HTTP endpoint with exponential backoff (0.5s → 10s, 120s total timeout).

The 5-State Model

Every service reported by mlx-stack status is in one of five states:

State	Condition
healthy	PID alive, HTTP 200 within 2 seconds
degraded	PID alive, HTTP 200 but response > 2 seconds
down	PID alive, no HTTP response within 5 seconds
crashed	PID file exists, but the process is dead
stopped	No PID file exists

Adding a New Model

To add a new model to the curated catalog:

Step 1: Create the YAML file

Create a new file in src/mlx_stack/data/catalog/ following the naming convention <family>-<size>.yaml:

id: my-model-8b
name: My Model 8B
family: My Model
params_b: 8.0
architecture: transformer
min_mlx_lm_version: "0.22.0"
sources:
  int4:
    hf_repo: mlx-community/My-Model-8B-4bit
    disk_size_gb: 4.5
  int8:
    hf_repo: mlx-community/My-Model-8B-8bit
    disk_size_gb: 8.5
  bf16:
    hf_repo: MyOrg/My-Model-8B
    disk_size_gb: 16.0
    convert_from: true
capabilities:
  tool_calling: true
  tool_call_parser: hermes
  thinking: false
  reasoning_parser: ""
  vision: false
quality:
  overall: 65
  coding: 60
  reasoning: 58
  instruction_following: 70
benchmarks:
  m4-pro-48:
    prompt_tps: 90.0
    gen_tps: 50.0
    memory_gb: 5.0
  m4-max-128:
    prompt_tps: 130.0
    gen_tps: 70.0
    memory_gb: 5.0
tags:
  - balanced
  - agent-ready

Field Reference

Field	Type	Required	Description
`id`	string	✓	Unique identifier (used in CLI commands)
`name`	string	✓	Human-readable display name
`family`	string	✓	Model family for grouping
`params_b`	float	✓	Parameter count in billions
`architecture`	string	✓	Architecture type (`transformer`, `mamba2-hybrid`, etc.)
`min_mlx_lm_version`	string	✓	Minimum mlx_lm version required
`sources`	dict	✓	Per-quant sources (`int4`, `int8`, `bf16`)
`sources.<quant>.hf_repo`	string	✓	HuggingFace repository path
`sources.<quant>.disk_size_gb`	float	✓	On-disk size in GB
`sources.<quant>.convert_from`	bool	—	If `true`, requires local conversion via mlx_lm
`capabilities.tool_calling`	bool	✓	Supports function/tool calling
`capabilities.tool_call_parser`	string	✓	Parser name (e.g., `hermes`) or empty string
`capabilities.thinking`	bool	✓	Supports thinking/reasoning mode
`capabilities.reasoning_parser`	string	✓	Parser name (e.g., `qwen3`) or empty string
`capabilities.vision`	bool	✓	Supports vision/image input
`quality.overall`	int	✓	Overall quality score (0–100)
`quality.coding`	int	✓	Coding quality score (0–100)
`quality.reasoning`	int	✓	Reasoning quality score (0–100)
`quality.instruction_following`	int	✓	Instruction following score (0–100)
`benchmarks`	dict	✓	Per-hardware-profile benchmark data
`benchmarks.<profile>.prompt_tps`	float	✓	Prompt tokens per second
`benchmarks.<profile>.gen_tps`	float	✓	Generation tokens per second
`benchmarks.<profile>.memory_gb`	float	✓	Runtime memory usage in GB
`tags`	list[str]	✓	Searchable labels

Step 2: Validate the catalog loads

uv run python -c "from mlx_stack.core.catalog import load_catalog; c = load_catalog(); print(f'{len(c)} models loaded')"

Step 3: Add tests

Add test cases in tests/unit/test_catalog.py to verify your new entry loads correctly and is queryable by family, tags, and capabilities.

Step 4: Run the full suite

uv run pytest

Adding a New CLI Command

To add a new command (e.g., mlx-stack export):

Step 1: Create the core module

Create src/mlx_stack/core/export.py with the business logic:

"""Export module for mlx-stack."""

from __future__ import annotations

from dataclasses import dataclass


@dataclass(frozen=True)
class ExportResult:
    """Result of an export operation."""

    path: str
    format: str


def export_stack(stack_path: str, output_format: str = "json") -> ExportResult:
    """Export a stack definition to the given format.

    Args:
        stack_path: Path to the stack YAML file.
        output_format: Output format ('json' or 'toml').

    Returns:
        ExportResult with the output path and format.
    """
    output_path = stack_path.rsplit(".", 1)[0] + f".{output_format}"
    # Read the stack YAML and convert to the requested format
    return ExportResult(path=output_path, format=output_format)

Step 2: Create the CLI command

Create src/mlx_stack/cli/export.py:

"""CLI command for mlx-stack export."""

from __future__ import annotations

import click
from rich.console import Console

from mlx_stack.core.export import export_stack

console = Console(stderr=True)


@click.command()
@click.argument("output_path", required=False)
@click.option("--format", "output_format", default="json", help="Output format.")
def export(output_path: str | None, output_format: str) -> None:
    """Export the active stack definition."""
    try:
        result = export_stack(output_path or "stack.json", output_format)
        console.print(f"[green]Exported to {result.path}[/green]")
    except Exception as exc:
        console.print(f"[red]Error: {exc}[/red]")
        raise SystemExit(1) from None

Step 3: Register in `cli/main.py`

Add the import and registration in src/mlx_stack/cli/main.py:

from mlx_stack.cli.export import export as export_command

# In the command registration section:
cli.add_command(export_command, "export")

Also add the command to the command_categories dict inside RichGroup.format_help() so it appears in the right help category.

Step 4: Add tests

Create both core and CLI tests:

tests/unit/test_export.py — test the core logic directly
tests/unit/test_cli_export.py — test the Click command via CliRunner

Example CLI test:

"""Tests for the export CLI command."""

from click.testing import CliRunner

from mlx_stack.cli.main import cli


def test_export_help():
    """Export command has help text."""
    runner = CliRunner()
    result = runner.invoke(cli, ["export", "--help"])
    assert result.exit_code == 0
    assert "Export" in result.output

Step 5: Run tests and checks

uv run pytest
uv run python -m pyright src/mlx_stack/cli/export.py src/mlx_stack/core/export.py
uv run ruff check src/ tests/

Testing

Strategy

mlx-stack uses a layered testing strategy:

Unit tests (primary layer, 80%+ coverage on core/) — All external calls (sysctl, system_profiler, subprocess.Popen, network requests) are mocked. Tests run fast and are deterministic.
CLI tests — Use Click's CliRunner to invoke commands in-process, capturing output and exit codes without spawning subprocesses.
Integration tests — Real system interaction (hardware detection, model downloads, launchctl). Excluded by default, run explicitly with -m integration.

Test isolation

Every test that touches the filesystem uses tmp_path fixtures to avoid modifying the real ~/.mlx-stack/ directory:

# tests/conftest.py provides these fixtures:

@pytest.fixture()
def mlx_stack_home(tmp_path, monkeypatch):
    """Isolated MLX_STACK_HOME that already exists."""
    home = tmp_path / ".mlx-stack"
    home.mkdir(parents=True, exist_ok=True)
    monkeypatch.setenv("MLX_STACK_HOME", str(home))
    return home

@pytest.fixture()
def clean_mlx_stack_home(tmp_path, monkeypatch):
    """MLX_STACK_HOME that does NOT exist yet (for auto-creation tests)."""
    home = tmp_path / ".mlx-stack"
    monkeypatch.setenv("MLX_STACK_HOME", str(home))
    return home

Use mlx_stack_home for most tests. Use clean_mlx_stack_home when testing directory auto-creation.

Running tests

# Run all unit tests (default — integration tests are excluded)
uv run pytest

# Run a specific test module
uv run pytest tests/unit/test_catalog.py -v

# Run tests matching a pattern
uv run pytest -k "test_scoring" -v

# Run with coverage report
uv run pytest --cov=src/mlx_stack --cov-report=term-missing

# Run integration tests only (requires Apple Silicon hardware)
uv run pytest -m integration -v

# Run everything including integration tests
uv run pytest -m "" -v

Mocking patterns

External system calls are always mocked in unit tests. Common patterns:

# Mocking subprocess calls (e.g., for process management)
def test_start_service(monkeypatch, mlx_stack_home):
    mock_popen = MagicMock()
    mock_popen.pid = 12345
    monkeypatch.setattr("subprocess.Popen", lambda *a, **kw: mock_popen)

# Mocking hardware detection (sysctl, system_profiler)
def test_detect_hardware(monkeypatch):
    monkeypatch.setattr(
        "mlx_stack.core.hardware._run_sysctl",
        lambda key: "Apple M4 Pro"
    )

# CLI tests with CliRunner
def test_profile_command(mlx_stack_home):
    runner = CliRunner()
    result = runner.invoke(cli, ["profile"])
    assert result.exit_code == 0

Code Quality

Linting with Ruff

Ruff handles both linting and formatting:

# Check for lint issues
uv run ruff check src/ tests/

# Auto-fix lint issues
uv run ruff check --fix src/ tests/

# Format code
uv run ruff format src/ tests/

# Check formatting without modifying files
uv run ruff format --check src/ tests/

Configuration in pyproject.toml:

[tool.ruff]
target-version = "py313"
line-length = 100
src = ["src", "tests"]

[tool.ruff.lint]
select = ["E", "F", "I", "W"]   # Errors, pyflakes, isort, warnings

Type checking with Pyright

Pyright is used for static type analysis:

uv run python -m pyright

Configuration in pyproject.toml:

[tool.pyright]
pythonVersion = "3.13"
pythonPlatform = "Darwin"
venvPath = "."
venv = ".venv"
typeCheckingMode = "basic"

The project uses pyright in basic mode, which enforces type annotations on function signatures and catches common type errors without requiring full strict-mode annotations everywhere. Contributors should aim for zero pyright errors — the CI pipeline enforces this. All public functions in core/ should have complete type annotations.

Pre-commit checklist

Before pushing, run all three checks:

uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run python -m pyright
uv run pytest

Configuration System

The configuration system is built around the ConfigKeyDef dataclass in core/config.py.

How ConfigKeyDef works

Each config key is defined as a frozen dataclass with validation metadata:

@dataclass(frozen=True)
class ConfigKeyDef:
    name: str           # Key name (e.g., "default-quant")
    description: str    # Human-readable description
    default: Any        # Default value
    value_type: str     # "string", "int", "bool", or "path"
    validator: str | None = None  # Named validator function

All keys are registered in the CONFIG_KEYS dict. The validate_key() function checks that a key exists, parse_value() handles type coercion and runs the appropriate validator, and mask_value() masks sensitive values (like openrouter-key) in display output.

Current config keys

Key	Type	Default	Validator
`openrouter-key`	string	`""`	—
`default-quant`	string	`int4`	Must be `int4`, `int8`, or `bf16`
`memory-budget-pct`	int	`40`	Must be 1–100
`litellm-port`	int	`4000`	Must be 1–65535
`model-dir`	path	`~/.mlx-stack/models`	—
`auto-health-check`	bool	`true`	—
`log-max-size-mb`	int	`50`	Must be ≥ 1
`log-max-files`	int	`5`	Must be ≥ 1

Adding a new config key

Define the key in the CONFIG_KEYS dict in src/mlx_stack/core/config.py:

CONFIG_KEYS: dict[str, ConfigKeyDef] = {
    # ... existing keys ...
    "my-new-key": ConfigKeyDef(
        name="my-new-key",
        description="Description of what this controls",
        default=42,
        value_type="int",
        validator="positive_int",  # or None, "quant", "memory_pct", "port"
    ),
}

Add a custom validator if needed by creating a _validate_<name>() function and adding a branch in parse_value().
Add tests in tests/unit/test_config.py covering:
- Default value is returned when unset
- Valid values are accepted and round-tripped
- Invalid values raise ConfigValidationError
- The key appears in config list output
Add CLI tests in tests/unit/test_cli_config.py verifying the key works through config set/get/list.

Process Management

The process lifecycle is managed by core/process.py, orchestrated by core/stack_up.py and core/stack_down.py.

Starting a service

start_service() in core/process.py:

Ensures the pids/ and logs/ directories exist
Opens a log file in append mode ("a") at ~/.mlx-stack/logs/<name>.log
Launches the process via subprocess.Popen with stdout/stderr redirected to the log file
Writes the PID to ~/.mlx-stack/pids/<name>.pid
If the PID file write fails, the process is killed to prevent leaking unmanaged processes

Waiting for healthy

wait_for_healthy() polls the service's HTTP endpoint:

Initial delay: 0.5 seconds
Backoff factor: 2× per retry (0.5s → 1s → 2s → 4s → 8s → 10s cap)
Maximum per-retry delay: 10 seconds
Total timeout: 120 seconds
Checks GET /v1/models and expects HTTP 200

Stopping a service

stop_service() implements graceful shutdown:

Sends SIGTERM to the process
Waits up to 10 seconds for the process to exit
If still alive, sends SIGKILL (forced termination)
Verifies the process is actually dead after SIGKILL
Removes the PID file only after confirmed termination
Returns a ShutdownResult indicating whether shutdown was graceful or forced

PID files

Each running service has a PID file at ~/.mlx-stack/pids/<service>.pid:

Contains exactly one integer (the process ID)
Created by start_service(), removed by stop_service()
status reads PID files to determine service state without modifying them
Stale PIDs (file exists but process is dead) are detected and cleaned up by up and down

Lockfile

The lockfile at ~/.mlx-stack/lock prevents concurrent up/down operations:

from mlx_stack.core.process import acquire_lock

with acquire_lock():
    # Only one process can hold the lock at a time
    start_services()

Uses fcntl.flock with LOCK_EX | LOCK_NB for non-blocking exclusive locking. The lock is automatically released when the file descriptor is closed (including on crash).

The status command and the watchdog's polling loop do not acquire the lock — they are read-only. The watchdog acquires the lock only during restart operations.

Commit Conventions

This project uses Conventional Commits:

<type>: <short description>

Types

Type	When to use
`feat:`	New feature or command
`fix:`	Bug fix
`test:`	Adding or updating tests
`docs:`	Documentation changes
`chore:`	Tooling, CI, dependency updates
`refactor:`	Code restructuring without behavior change

Examples

feat: implement mlx-stack pull command with HuggingFace download
fix: clamp normalized speed scores to [0,1] range
test: add regression tests for high-bandwidth hardware scoring
docs: add 24/7 ops section to README
chore: add GitHub Actions CI workflow for macOS

Keep the subject line under 72 characters. Use the imperative mood ("add", "fix", "implement" — not "added", "fixed").

PR Process

1. Fork and branch

# Fork the repo on GitHub, then:
git clone https://github.com/<your-username>/mlx-stack.git
cd mlx-stack
git remote add upstream https://github.com/weklund/mlx-stack.git
git checkout -b feat/my-feature

2. Make your changes

Follow the patterns in this guide — thin CLI wrappers, business logic in core/, tests for both layers.

3. Write tests

Unit tests for all new core/ functions
CLI tests via CliRunner for new commands
Use mlx_stack_home fixture for filesystem isolation

4. Run all checks

# Tests
uv run pytest

# Type checking
uv run python -m pyright

# Linting
uv run ruff check src/ tests/

# Formatting
uv run ruff format --check src/ tests/

All four must pass before submitting.

5. Commit with conventional format

git add .
git commit -m "feat: add export command for stack definitions"

6. Push and open a PR

git push origin feat/my-feature

Open a pull request against the main branch on GitHub.

PR checklist

All tests pass (uv run pytest)
Type checking is clean (uv run python -m pyright)
Linting is clean (uv run ruff check src/ tests/)
Code is formatted (uv run ruff format --check src/ tests/)
Commit messages follow conventional format
New code has test coverage
No Python tracebacks reach the user for expected error scenarios

FilesExpand file tree

DEVELOPING.md

Latest commit

History

DEVELOPING.md

File metadata and controls

Developing mlx-stack

Table of Contents

Prerequisites

Getting Started

1. Clone the repository

2. Install dependencies

3. Verify the installation

4. Run the tests

Project Architecture

Design Principles

Key Concepts

Tiers

Catalog Entries

Stack Definitions

Process Management

The 5-State Model

Adding a New Model

Step 1: Create the YAML file

Field Reference

Step 2: Validate the catalog loads

Step 3: Add tests

Step 4: Run the full suite

Adding a New CLI Command

Step 1: Create the core module

Step 2: Create the CLI command

Step 3: Register in cli/main.py

Step 4: Add tests

Step 5: Run tests and checks

Testing

Strategy

Test isolation

Running tests

Mocking patterns

Code Quality

Linting with Ruff

Type checking with Pyright

Pre-commit checklist

Configuration System

How ConfigKeyDef works

Current config keys

Adding a new config key

Process Management

Starting a service

Waiting for healthy

Stopping a service

PID files

Lockfile

Commit Conventions

Types

Examples

PR Process

1. Fork and branch

2. Make your changes

3. Write tests

4. Run all checks

5. Commit with conventional format

6. Push and open a PR

PR checklist

Step 3: Register in `cli/main.py`