A comprehensive guide for contributors working on the mlx-stack codebase.
- Prerequisites
- Getting Started
- Project Architecture
- Key Concepts
- Adding a New Model
- Adding a New CLI Command
- Testing
- Code Quality
- Configuration System
- Process Management
- Commit Conventions
- PR Process
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.13+ | Apple Silicon native build recommended |
| uv | 0.10+ | Fast Python package manager — install guide |
| macOS | Apple Silicon (M1+) | Intel Macs and Linux are not supported |
| Git | 2.30+ | For version control |
mlx-stack is designed exclusively for Apple Silicon Macs. Hardware detection, model serving (vllm-mlx), and performance benchmarks all depend on the Metal GPU and unified memory architecture.
git clone https://github.com/weklund/mlx-stack.git
cd mlx-stackuv syncThis creates a virtual environment in .venv/ and installs all runtime and dev dependencies (pytest, pyright, ruff, etc.) from pyproject.toml.
uv run mlx-stack --helpYou should see the Rich-formatted help output listing all commands grouped by category (Setup & Configuration, Model Management, Stack Lifecycle, Diagnostics).
uv run pytestAll unit tests should pass. Integration tests are excluded by default (see Testing for details).
src/mlx_stack/
├── __init__.py # Package root — exports __version__
├── py.typed # PEP 561 marker for type checking
├── cli/ # CLI layer — thin Click wrappers
│ ├── __init__.py # Exports the cli() entry point
│ ├── main.py # Click group, --help/--version, typo suggestions
│ ├── profile.py # mlx-stack profile
│ ├── config.py # mlx-stack config (set/get/list/reset)
│ ├── recommend.py # mlx-stack recommend
│ ├── init.py # mlx-stack init
│ ├── models.py # mlx-stack models
│ ├── pull.py # mlx-stack pull
│ ├── up.py # mlx-stack up
│ ├── down.py # mlx-stack down
│ ├── status.py # mlx-stack status
│ ├── bench.py # mlx-stack bench
│ ├── logs.py # mlx-stack logs
│ ├── watch.py # mlx-stack watch
│ └── install.py # mlx-stack install / uninstall
├── core/ # Business logic — importable without CLI
│ ├── __init__.py
│ ├── paths.py # Path management (MLX_STACK_HOME, data dirs)
│ ├── hardware.py # Apple Silicon detection, bandwidth lookup
│ ├── catalog.py # YAML catalog loading, validation, querying
│ ├── config.py # ConfigKeyDef system, get/set/validate/persist
│ ├── deps.py # Dependency management (vllm-mlx, litellm)
│ ├── scoring.py # Recommendation engine, tier assignment
│ ├── stack_init.py # Stack definition + LiteLLM config generation
│ ├── litellm_gen.py # LiteLLM YAML config builder
│ ├── models.py # Local model inventory management
│ ├── process.py # PID files, lockfile, health checks, start/stop
│ ├── stack_up.py # Orchestrates starting all services
│ ├── stack_down.py # Orchestrates stopping all services
│ ├── stack_status.py # 5-state health reporting
│ ├── pull.py # Model download with HuggingFace Hub
│ ├── benchmark.py # Benchmarking engine (prompt/gen TPS)
│ ├── log_rotation.py # Copytruncate log rotation
│ ├── log_viewer.py # Log viewing, following, archive reading
│ ├── watchdog.py # Health monitor with auto-restart
│ └── launchd.py # macOS LaunchAgent integration
├── data/ # Static data shipped with the package
│ ├── __init__.py
│ └── catalog/ # 15 model YAML files
│ ├── qwen3.5-0.8b.yaml
│ ├── qwen3.5-3b.yaml
│ ├── qwen3.5-8b.yaml
│ ├── qwen3.5-14b.yaml
│ ├── qwen3.5-32b.yaml
│ ├── qwen3.5-72b.yaml
│ ├── gemma3-4b.yaml
│ ├── gemma3-12b.yaml
│ ├── gemma3-27b.yaml
│ ├── deepseek-r1-8b.yaml
│ ├── deepseek-r1-32b.yaml
│ ├── nemotron-8b.yaml
│ ├── nemotron-49b.yaml
│ ├── qwen3-8b.yaml
│ └── llama3.3-8b.yaml
└── utils/ # Shared utility modules
└── __init__.py
tests/
├── conftest.py # Shared fixtures (mlx_stack_home, etc.)
├── unit/ # Unit tests — mocked external calls
│ ├── test_hardware.py
│ ├── test_catalog.py
│ ├── test_config.py
│ ├── test_deps.py
│ ├── test_scoring.py
│ ├── test_process.py
│ ├── test_cli_profile.py
│ ├── test_cli_config.py
│ ├── test_cli_recommend.py
│ ├── test_cli_init.py
│ ├── test_cli_models.py
│ ├── test_cli_up.py
│ ├── test_cli_down.py
│ ├── test_cli_status.py
│ ├── test_cli_pull.py
│ ├── test_cli_bench.py
│ ├── test_cli_logs.py
│ ├── test_cli_watch.py
│ ├── test_cli_install.py
│ └── ... # Cross-area, robustness, lifecycle tests
├── integration/ # Real-system integration tests
│ ├── test_inference_e2e.py
│ └── test_launchd_e2e.py
└── fixtures/ # Shared test data
-
CLI modules are thin wrappers. Each file in
cli/defines a Click command that parses arguments, calls intocore/, and formats output with Rich. No business logic lives incli/. -
core/has all business logic. Every module incore/is importable and testable independently of the CLI layer. This makes unit testing straightforward — you testcore/functions directly. -
data/catalog/holds curated model YAMLs. These are loaded at runtime viaimportlib.resourcesso they work correctly whether installed as a package or run from source. -
All state lives in
~/.mlx-stack/. The data directory (overridable viaMLX_STACK_HOMEenv var) containsprofile.json,config.yaml,stacks/,models/,pids/,logs/, andbenchmarks/.
mlx-stack assigns models to three tiers, each optimized for a different workload:
| Tier | Port | Purpose |
|---|---|---|
standard |
8000 | Highest-quality model within memory budget |
fast |
8001 | Fastest model for latency-sensitive tasks |
longctx |
8002 | Architecturally diverse model (e.g., Mamba2 hybrid) |
Tier assignment is performed by the scoring engine in core/scoring.py. The standard tier gets the model with the highest composite score weighted toward quality, fast gets the highest speed, and longctx prefers architecturally different models when available.
Each model in data/catalog/ is a YAML file describing:
- Identity:
id,name,family,params_b,architecture - Sources: Per-quantization HuggingFace repos with
disk_size_gb - Capabilities:
tool_calling,thinking,vision, parser names - Quality scores:
overall,coding,reasoning,instruction_following(0–100) - Benchmarks: Per-hardware
prompt_tps,gen_tps,memory_gb - Tags: Searchable labels like
balanced,agent-ready,thinking
A stack definition (~/.mlx-stack/stacks/default.yaml) is the output of mlx-stack init. It specifies:
schema_version: 1— for forward compatibilityhardware_profile— the profile ID (e.g.,m4-pro-48)intent— the optimization strategy used (balancedoragent-fleet)tiers— list of tier entries, each withname,model,quant,source,port, andvllm_flags
A companion ~/.mlx-stack/litellm.yaml is generated alongside it for the LiteLLM proxy.
mlx-stack manages vllm-mlx and LiteLLM as subprocesses:
- PID files in
~/.mlx-stack/pids/track each running service (e.g.,fast.pid,litellm.pid). Each file contains a single integer PID. - Lockfile at
~/.mlx-stack/lockusesfcntl.flockto prevent concurrentup/downoperations. - Health checks poll each service's HTTP endpoint with exponential backoff (0.5s → 10s, 120s total timeout).
Every service reported by mlx-stack status is in one of five states:
| State | Condition |
|---|---|
| healthy | PID alive, HTTP 200 within 2 seconds |
| degraded | PID alive, HTTP 200 but response > 2 seconds |
| down | PID alive, no HTTP response within 5 seconds |
| crashed | PID file exists, but the process is dead |
| stopped | No PID file exists |
To add a new model to the curated catalog:
Create a new file in src/mlx_stack/data/catalog/ following the naming convention <family>-<size>.yaml:
id: my-model-8b
name: My Model 8B
family: My Model
params_b: 8.0
architecture: transformer
min_mlx_lm_version: "0.22.0"
sources:
int4:
hf_repo: mlx-community/My-Model-8B-4bit
disk_size_gb: 4.5
int8:
hf_repo: mlx-community/My-Model-8B-8bit
disk_size_gb: 8.5
bf16:
hf_repo: MyOrg/My-Model-8B
disk_size_gb: 16.0
convert_from: true
capabilities:
tool_calling: true
tool_call_parser: hermes
thinking: false
reasoning_parser: ""
vision: false
quality:
overall: 65
coding: 60
reasoning: 58
instruction_following: 70
benchmarks:
m4-pro-48:
prompt_tps: 90.0
gen_tps: 50.0
memory_gb: 5.0
m4-max-128:
prompt_tps: 130.0
gen_tps: 70.0
memory_gb: 5.0
tags:
- balanced
- agent-ready| Field | Type | Required | Description |
|---|---|---|---|
id |
string | ✓ | Unique identifier (used in CLI commands) |
name |
string | ✓ | Human-readable display name |
family |
string | ✓ | Model family for grouping |
params_b |
float | ✓ | Parameter count in billions |
architecture |
string | ✓ | Architecture type (transformer, mamba2-hybrid, etc.) |
min_mlx_lm_version |
string | ✓ | Minimum mlx_lm version required |
sources |
dict | ✓ | Per-quant sources (int4, int8, bf16) |
sources.<quant>.hf_repo |
string | ✓ | HuggingFace repository path |
sources.<quant>.disk_size_gb |
float | ✓ | On-disk size in GB |
sources.<quant>.convert_from |
bool | — | If true, requires local conversion via mlx_lm |
capabilities.tool_calling |
bool | ✓ | Supports function/tool calling |
capabilities.tool_call_parser |
string | ✓ | Parser name (e.g., hermes) or empty string |
capabilities.thinking |
bool | ✓ | Supports thinking/reasoning mode |
capabilities.reasoning_parser |
string | ✓ | Parser name (e.g., qwen3) or empty string |
capabilities.vision |
bool | ✓ | Supports vision/image input |
quality.overall |
int | ✓ | Overall quality score (0–100) |
quality.coding |
int | ✓ | Coding quality score (0–100) |
quality.reasoning |
int | ✓ | Reasoning quality score (0–100) |
quality.instruction_following |
int | ✓ | Instruction following score (0–100) |
benchmarks |
dict | ✓ | Per-hardware-profile benchmark data |
benchmarks.<profile>.prompt_tps |
float | ✓ | Prompt tokens per second |
benchmarks.<profile>.gen_tps |
float | ✓ | Generation tokens per second |
benchmarks.<profile>.memory_gb |
float | ✓ | Runtime memory usage in GB |
tags |
list[str] | ✓ | Searchable labels |
uv run python -c "from mlx_stack.core.catalog import load_catalog; c = load_catalog(); print(f'{len(c)} models loaded')"Add test cases in tests/unit/test_catalog.py to verify your new entry loads correctly and is queryable by family, tags, and capabilities.
uv run pytestTo add a new command (e.g., mlx-stack export):
Create src/mlx_stack/core/export.py with the business logic:
"""Export module for mlx-stack."""
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class ExportResult:
"""Result of an export operation."""
path: str
format: str
def export_stack(stack_path: str, output_format: str = "json") -> ExportResult:
"""Export a stack definition to the given format.
Args:
stack_path: Path to the stack YAML file.
output_format: Output format ('json' or 'toml').
Returns:
ExportResult with the output path and format.
"""
output_path = stack_path.rsplit(".", 1)[0] + f".{output_format}"
# Read the stack YAML and convert to the requested format
return ExportResult(path=output_path, format=output_format)Create src/mlx_stack/cli/export.py:
"""CLI command for mlx-stack export."""
from __future__ import annotations
import click
from rich.console import Console
from mlx_stack.core.export import export_stack
console = Console(stderr=True)
@click.command()
@click.argument("output_path", required=False)
@click.option("--format", "output_format", default="json", help="Output format.")
def export(output_path: str | None, output_format: str) -> None:
"""Export the active stack definition."""
try:
result = export_stack(output_path or "stack.json", output_format)
console.print(f"[green]Exported to {result.path}[/green]")
except Exception as exc:
console.print(f"[red]Error: {exc}[/red]")
raise SystemExit(1) from NoneAdd the import and registration in src/mlx_stack/cli/main.py:
from mlx_stack.cli.export import export as export_command
# In the command registration section:
cli.add_command(export_command, "export")Also add the command to the command_categories dict inside RichGroup.format_help() so it appears in the right help category.
Create both core and CLI tests:
tests/unit/test_export.py— test the core logic directlytests/unit/test_cli_export.py— test the Click command viaCliRunner
Example CLI test:
"""Tests for the export CLI command."""
from click.testing import CliRunner
from mlx_stack.cli.main import cli
def test_export_help():
"""Export command has help text."""
runner = CliRunner()
result = runner.invoke(cli, ["export", "--help"])
assert result.exit_code == 0
assert "Export" in result.outputuv run pytest
uv run python -m pyright src/mlx_stack/cli/export.py src/mlx_stack/core/export.py
uv run ruff check src/ tests/mlx-stack uses a layered testing strategy:
- Unit tests (primary layer, 80%+ coverage on
core/) — All external calls (sysctl,system_profiler,subprocess.Popen, network requests) are mocked. Tests run fast and are deterministic. - CLI tests — Use Click's
CliRunnerto invoke commands in-process, capturing output and exit codes without spawning subprocesses. - Integration tests — Real system interaction (hardware detection, model downloads, launchctl). Excluded by default, run explicitly with
-m integration.
Every test that touches the filesystem uses tmp_path fixtures to avoid modifying the real ~/.mlx-stack/ directory:
# tests/conftest.py provides these fixtures:
@pytest.fixture()
def mlx_stack_home(tmp_path, monkeypatch):
"""Isolated MLX_STACK_HOME that already exists."""
home = tmp_path / ".mlx-stack"
home.mkdir(parents=True, exist_ok=True)
monkeypatch.setenv("MLX_STACK_HOME", str(home))
return home
@pytest.fixture()
def clean_mlx_stack_home(tmp_path, monkeypatch):
"""MLX_STACK_HOME that does NOT exist yet (for auto-creation tests)."""
home = tmp_path / ".mlx-stack"
monkeypatch.setenv("MLX_STACK_HOME", str(home))
return homeUse mlx_stack_home for most tests. Use clean_mlx_stack_home when testing directory auto-creation.
# Run all unit tests (default — integration tests are excluded)
uv run pytest
# Run a specific test module
uv run pytest tests/unit/test_catalog.py -v
# Run tests matching a pattern
uv run pytest -k "test_scoring" -v
# Run with coverage report
uv run pytest --cov=src/mlx_stack --cov-report=term-missing
# Run integration tests only (requires Apple Silicon hardware)
uv run pytest -m integration -v
# Run everything including integration tests
uv run pytest -m "" -vExternal system calls are always mocked in unit tests. Common patterns:
# Mocking subprocess calls (e.g., for process management)
def test_start_service(monkeypatch, mlx_stack_home):
mock_popen = MagicMock()
mock_popen.pid = 12345
monkeypatch.setattr("subprocess.Popen", lambda *a, **kw: mock_popen)
# Mocking hardware detection (sysctl, system_profiler)
def test_detect_hardware(monkeypatch):
monkeypatch.setattr(
"mlx_stack.core.hardware._run_sysctl",
lambda key: "Apple M4 Pro"
)
# CLI tests with CliRunner
def test_profile_command(mlx_stack_home):
runner = CliRunner()
result = runner.invoke(cli, ["profile"])
assert result.exit_code == 0Ruff handles both linting and formatting:
# Check for lint issues
uv run ruff check src/ tests/
# Auto-fix lint issues
uv run ruff check --fix src/ tests/
# Format code
uv run ruff format src/ tests/
# Check formatting without modifying files
uv run ruff format --check src/ tests/Configuration in pyproject.toml:
[tool.ruff]
target-version = "py313"
line-length = 100
src = ["src", "tests"]
[tool.ruff.lint]
select = ["E", "F", "I", "W"] # Errors, pyflakes, isort, warningsPyright is used for static type analysis:
uv run python -m pyrightConfiguration in pyproject.toml:
[tool.pyright]
pythonVersion = "3.13"
pythonPlatform = "Darwin"
venvPath = "."
venv = ".venv"
typeCheckingMode = "basic"The project uses pyright in basic mode, which enforces type annotations on function signatures and catches common type errors without requiring full strict-mode annotations everywhere. Contributors should aim for zero pyright errors — the CI pipeline enforces this. All public functions in core/ should have complete type annotations.
Before pushing, run all three checks:
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
uv run python -m pyright
uv run pytestThe configuration system is built around the ConfigKeyDef dataclass in core/config.py.
Each config key is defined as a frozen dataclass with validation metadata:
@dataclass(frozen=True)
class ConfigKeyDef:
name: str # Key name (e.g., "default-quant")
description: str # Human-readable description
default: Any # Default value
value_type: str # "string", "int", "bool", or "path"
validator: str | None = None # Named validator functionAll keys are registered in the CONFIG_KEYS dict. The validate_key() function checks that a key exists, parse_value() handles type coercion and runs the appropriate validator, and mask_value() masks sensitive values (like openrouter-key) in display output.
| Key | Type | Default | Validator |
|---|---|---|---|
openrouter-key |
string | "" |
— |
default-quant |
string | int4 |
Must be int4, int8, or bf16 |
memory-budget-pct |
int | 40 |
Must be 1–100 |
litellm-port |
int | 4000 |
Must be 1–65535 |
model-dir |
path | ~/.mlx-stack/models |
— |
auto-health-check |
bool | true |
— |
log-max-size-mb |
int | 50 |
Must be ≥ 1 |
log-max-files |
int | 5 |
Must be ≥ 1 |
- Define the key in the
CONFIG_KEYSdict insrc/mlx_stack/core/config.py:
CONFIG_KEYS: dict[str, ConfigKeyDef] = {
# ... existing keys ...
"my-new-key": ConfigKeyDef(
name="my-new-key",
description="Description of what this controls",
default=42,
value_type="int",
validator="positive_int", # or None, "quant", "memory_pct", "port"
),
}-
Add a custom validator if needed by creating a
_validate_<name>()function and adding a branch inparse_value(). -
Add tests in
tests/unit/test_config.pycovering:- Default value is returned when unset
- Valid values are accepted and round-tripped
- Invalid values raise
ConfigValidationError - The key appears in
config listoutput
-
Add CLI tests in
tests/unit/test_cli_config.pyverifying the key works throughconfig set/get/list.
The process lifecycle is managed by core/process.py, orchestrated by core/stack_up.py and core/stack_down.py.
start_service() in core/process.py:
- Ensures the
pids/andlogs/directories exist - Opens a log file in append mode (
"a") at~/.mlx-stack/logs/<name>.log - Launches the process via
subprocess.Popenwith stdout/stderr redirected to the log file - Writes the PID to
~/.mlx-stack/pids/<name>.pid - If the PID file write fails, the process is killed to prevent leaking unmanaged processes
wait_for_healthy() polls the service's HTTP endpoint:
- Initial delay: 0.5 seconds
- Backoff factor: 2× per retry (0.5s → 1s → 2s → 4s → 8s → 10s cap)
- Maximum per-retry delay: 10 seconds
- Total timeout: 120 seconds
- Checks
GET /v1/modelsand expects HTTP 200
stop_service() implements graceful shutdown:
- Sends SIGTERM to the process
- Waits up to 10 seconds for the process to exit
- If still alive, sends SIGKILL (forced termination)
- Verifies the process is actually dead after SIGKILL
- Removes the PID file only after confirmed termination
- Returns a
ShutdownResultindicating whether shutdown was graceful or forced
Each running service has a PID file at ~/.mlx-stack/pids/<service>.pid:
- Contains exactly one integer (the process ID)
- Created by
start_service(), removed bystop_service() statusreads PID files to determine service state without modifying them- Stale PIDs (file exists but process is dead) are detected and cleaned up by
upanddown
The lockfile at ~/.mlx-stack/lock prevents concurrent up/down operations:
from mlx_stack.core.process import acquire_lock
with acquire_lock():
# Only one process can hold the lock at a time
start_services()Uses fcntl.flock with LOCK_EX | LOCK_NB for non-blocking exclusive locking. The lock is automatically released when the file descriptor is closed (including on crash).
The status command and the watchdog's polling loop do not acquire the lock — they are read-only. The watchdog acquires the lock only during restart operations.
This project uses Conventional Commits:
<type>: <short description>
| Type | When to use |
|---|---|
feat: |
New feature or command |
fix: |
Bug fix |
test: |
Adding or updating tests |
docs: |
Documentation changes |
chore: |
Tooling, CI, dependency updates |
refactor: |
Code restructuring without behavior change |
feat: implement mlx-stack pull command with HuggingFace download
fix: clamp normalized speed scores to [0,1] range
test: add regression tests for high-bandwidth hardware scoring
docs: add 24/7 ops section to README
chore: add GitHub Actions CI workflow for macOS
Keep the subject line under 72 characters. Use the imperative mood ("add", "fix", "implement" — not "added", "fixed").
# Fork the repo on GitHub, then:
git clone https://github.com/<your-username>/mlx-stack.git
cd mlx-stack
git remote add upstream https://github.com/weklund/mlx-stack.git
git checkout -b feat/my-featureFollow the patterns in this guide — thin CLI wrappers, business logic in core/, tests for both layers.
- Unit tests for all new
core/functions - CLI tests via
CliRunnerfor new commands - Use
mlx_stack_homefixture for filesystem isolation
# Tests
uv run pytest
# Type checking
uv run python -m pyright
# Linting
uv run ruff check src/ tests/
# Formatting
uv run ruff format --check src/ tests/All four must pass before submitting.
git add .
git commit -m "feat: add export command for stack definitions"git push origin feat/my-featureOpen a pull request against the main branch on GitHub.
- All tests pass (
uv run pytest) - Type checking is clean (
uv run python -m pyright) - Linting is clean (
uv run ruff check src/ tests/) - Code is formatted (
uv run ruff format --check src/ tests/) - Commit messages follow conventional format
- New code has test coverage
- No Python tracebacks reach the user for expected error scenarios