where Python meets Metal, and layouts become algebra
📖 Documentation · 📦 PyPI · ⭐ GitHub
In 1945, an Enigma machine sank to the floor of the Baltic Sea. For decades it sat there, its rotors locked, its wiring intact, waiting. When divers finally pulled it from the silt, the mechanism still worked. The genius was never in the shell. It was in the rotors, the wiring, the algebra of permutations hidden inside.
Enigma DSL is built on the same principle. Inspired by NVIDIA's CuTe DSL, which brought layout algebra and tiling calculus to CUDA, Enigma brings the same mathematical framework to Apple Metal. Where CuTe targets tensor cores and warps on NVIDIA GPUs, Enigma targets simdgroups and threadgroups on Apple Silicon. The layout algebra is the same. The target is different. You write a Python function. Underneath, the algebra computes how threads map to memory, how tiles partition a tensor, how values flow through a simdgroup. The Python traces into an IR. The IR emits Metal C++. The Metal compiles to GPU machine code. Your function runs on Apple Silicon at hardware bandwidth limits. The surface is clean. The machinery is exact.
Enigma is a Python DSL for writing Apple Metal GPU compute kernels. You write
a Python function decorated with @enigma.kernel; Enigma traces it into MLIR,
emits Metal Shading Language, compiles through xcrun metal, and dispatches
on the GPU through a Swift runtime. The DSL surface is small. The generated
kernels run at hand-written-Metal bandwidth.
Inspired by NVIDIA's CuTe DSL — Enigma brings the same layout-algebra and tiling calculus to Apple Silicon. CuTe targets warps and tensor cores; Enigma targets simdgroups and threadgroups.
Python @enigma.kernel → traced IR → MLIR (enigma dialect) → MSL
↓
xcrun metal → AIR
↓
xcrun metallib → .metallib
↓
Swift runtime (ctypes) → GPU
For tiled kernels, @enigma.jit runs CuTe-style layout algebra host-side
before launching: composition, complement, coalesce, zipped divide, and the
make_layout_tv constructor that maps thread × value indices to tile
coordinates with correct coalescing order. The entire vectorisation strategy
is decided at trace time; only the resulting offsets and memory transactions
reach the GPU.
Requirements: Apple Silicon Mac (M1 through M5), macOS 14 / 15 (or any newer release), Python 3.11 / 3.12 / 3.13.
pip install enigma-dslThat single command pulls a self-contained wheel that bundles the Python
DSL and the native MLIR dialect (libLLVM, libMLIRPythonCAPI, the
Enigma dialect .so) — no separate steps, no LLVM toolchain on your
machine. pip picks the right wheel for your Python version and macOS
version automatically.
The release ships six wheels: 3 Python versions × 2 macOS deployment targets (14.0 and 15.0). macOS 14-tagged wheels run on macOS 14, 15, 26 and every future version; macOS 15-tagged wheels are picked first on macOS 15+ hosts because pip prefers the most specific match.
Quick verification:
python -c "import enigma; print(enigma.__version__)"
# 0.1.1Save as add.py and run with python add.py — no setup beyond
pip install enigma-dsl numpy. This is the shortest end-to-end
trace → MSL codegen → GPU dispatch you can write.
import numpy as np
import enigma
@enigma.kernel
def add(A: enigma.f32, B: enigma.f32, C: enigma.f32):
i = enigma.thread_position_in_grid
C[i] = A[i] + B[i]
compiled = enigma.compile(add)
N = 1024
a = np.random.randn(N).astype(np.float32)
b = np.random.randn(N).astype(np.float32)
raw = enigma.MetalRuntime().execute(
compiled, inputs=[a, b], output_size=N * 4,
grid=(N, 1, 1), threads=(256, 1, 1),
)
c = np.frombuffer(raw, dtype=np.float32)
print("max |error| =", float(np.max(np.abs(c - (a + b))))) # 0.0What ran:
@enigma.kerneltraced the Python function into MLIR (theenigmadialect).enigma.compilelowered it to Metal Shading Language, then ran it throughxcrun metal→ AIR →xcrun metallib→.metallib.MetalRuntime().execute(...)mmap'd the.metallib, allocated GPU buffers, dispatched the compute pass, and returned the result bytes.
For richer examples — RMSNorm, FlashAttention, 1D Laplacian — see the Showcase kernels section below.
You only need this path if you are hacking on the dialect itself, or porting to a future LLVM. The pipeline is two stages — dialect (native, C++/MLIR) first, then the Python DSL on top of it.
git clone https://github.com/Klyne-org/Enigma-DSL.git
cd Enigma-DSL
git submodule update --init --recursive # pulls Enigma-Dialect
# Stage 1: build LLVM 22.x + MLIR (one-time, ~30-90 min).
# Produces ~/.local/enigma-llvm/ — isolated from any Homebrew LLVM.
bash scripts/build_llvm.sh
# Stage 2: build the merged wheel (DSL + dialect) for the Python
# version of your choice. Repeat the --python flag for each version.
bash scripts/build_all.sh --python 3.12
# The wheel lands in wheelhouse/. The script also creates a venv
# (.venv for 3.12, .venv-py<X.Y> otherwise), installs into it, and
# runs pytest by default.What scripts/build_all.sh actually does, in order:
- Sources
~/.local/enigma-llvm/activate.shto put the local MLIR onMLIR_DIR/LLVM_DIR(skipped ifMLIR_DIRis already set, or if--skip-dialectis passed). - Builds
enigma-dslas a pure-Python wheel (py3-none-any). - Builds
enigma-dialectas a native wheel (cpXY-cpXY-macosx_*_arm64) — one per Python version. This invokesscikit-build-corewhich in turn drives CMake against the local LLVM build. - Fixes Mach-O rpaths and re-codesigns the bundled dylibs so they
load from
@loader_path. - Merges the two wheels into a single
enigma_dsl-*-cpXY-cpXY-*.whlcontaining bothenigma/andmlir/packages — what users eventuallypip installfrom PyPI. - Creates a per-Python venv, installs the merged wheel into it, and
runs
pytest tests/(steps 5–6 are skipped by the corresponding--no-merge/--no-install/--no-testflags).
The macOS deployment target defaults to the host's OS major (Darwin 24
→ macOS 15.0, Darwin 23 → macOS 14.0); override it with --macos.
Common variations:
# Multi-version build (publish-ready):
bash scripts/build_all.sh --python 3.11 --python 3.12 --python 3.13 \
--macos 14.0 --no-test --no-install --clean
# Build the dialect against an existing LLVM in a non-default location:
MLIR_DIR=/path/to/lib/cmake/mlir \
bash scripts/build_all.sh --python 3.12
# Build only the dialect (skip Python DSL):
bash scripts/build_all.sh --python 3.12 --skip-dsl
# Build only the DSL (reuse a previously built dialect wheel):
bash scripts/build_all.sh --python 3.12 --skip-dialect
# Keep the two wheels separate instead of merging them:
bash scripts/build_all.sh --python 3.12 --no-merge
# Write wheels to a custom directory instead of ./wheelhouse:
bash scripts/build_all.sh --python 3.12 --out /tmp/wheelsThe LLVM step is the expensive one. After the first build_llvm.sh
finishes, every subsequent build_all.sh reuses it — incremental
dialect builds are ~3-5 min per Python. Pass --clean to wipe the
build caches and rebuild from scratch.
import numpy as np
import enigma
@enigma.kernel
def vector_add(A: enigma.f32, B: enigma.f32, C: enigma.f32):
tid = enigma.thread_position_in_grid
C[tid] = A[tid] + B[tid]
compiled = enigma.compile(vector_add)
print(compiled.metal_source) # generated MSL — readable, debuggable
N = 1024
a, b = np.random.randn(N).astype(np.float32), np.random.randn(N).astype(np.float32)
runtime = enigma.MetalRuntime()
raw = runtime.execute(compiled, inputs=[a, b], output_size=N * 4,
grid=(N, 1, 1), threads=(256, 1, 1))
c = np.frombuffer(raw, dtype=np.float32)
assert np.allclose(c, a + b)The examples/ directory has four end-to-end showcase kernels. Each ships
in three forms: the Enigma DSL version, the equivalent handwritten Metal
shader for comparison, and (where applicable) a benchmark harness that
times both.
| Kernel | DSL | Handwritten | What it shows |
|---|---|---|---|
| Vector add | vector_add.py, vector_add_tv.py |
vector_add_naive.metal, vector_add_float4.metal, add_kernel_tv.metal |
The hello-world. The TV variant uses @enigma.jit + layout algebra to choose vectorisation. |
| RMSNorm | benchmark_rmsnorm.py |
rmsnorm_handwritten.metal |
Reduction across a row, threadgroup shared memory, simd_sum. The benchmark times Enigma vs handwritten side-by-side. |
| SDPA / FlashAttention | benchmark_sdpa.py, showcase_attention.py |
sdpa_handwritten.metal |
Fused attention forward — online softmax, multi-simdgroup tiling, threadgroup reductions. |
| 1D Laplacian | conv1d_laplacian.py |
conv1d_laplacian_handwritten.metal |
Finite-difference stencil for PDE solvers (heat eq., diffusion). Boundary handling via enigma.if_. |
Run any of them directly:
python examples/conv1d_laplacian.py # prints generated MSL + numpy diff
python examples/benchmark_rmsnorm.py # benchmarks Enigma vs handwritten
python examples/showcase_attention.py # FlashAttention forwardMeasured on a MacBook Air M4 (8-core GPU, 16 GB unified memory, 120 GB/s memory bandwidth, ~3.6 TFLOPS FP32 theoretical peak). All kernels pass correctness against a NumPy reference.
Single-dispatch megakernel: RMSNorm → QKV proj → head-norm + RoPE → SDPA → O-proj → SwiGLU → down-proj, all in one threadgroup.
| Metric | Value |
|---|---|
| Throughput | 92.6 tok/s |
| Latency | 10.79 ms/tok |
| Compile time | 0.08 s (14 KB of generated MSL) |
| Correctness | `max |
Reproduce:
python examples/benchmark_sdpa.py
python examples/qwen_megakernel.py- Kernel surface: arithmetic, unary/binary/ternary float math, integer
intrinsics (popcount, clz, mulhi, …), vector ops (
make_float4,dot,cross,length, …), pack/unpack ops, comparisons,select/where,if_,for_range, casts, atomics with explicit memory order. - GPU-specific: thread/threadgroup/grid queries, simdgroup reductions and shuffles, quad-group ops, threadgroup shared memory and barriers, simdgroup matrix ops on Apple's hardware matrix units.
- Layout algebra (CuTe-style):
Layout,Shape,Stridewith composition, complement, coalesce, recast, zipped divide. Themake_layout_tvconstructor builds Thread × Value layouts. - Two compilation paths:
@enigma.kernel(raw, you set the grid) and@enigma.jit(layout-driven, the engine sets the grid). - Compatibility: tested on Apple M-series GPUs through
xcrun metal/xcrun metallib. Float32, float16, bfloat16, integer widths 8–64.
docs/api-reference.md— exhaustive op-by-op reference (30 sections, every primitive).Enigma-Dialect/— submodule with the C++/MLIR dialect definition, MSL emitter, and dialect-level lit tests.
python -m pytest tests/ # full Python suite
bash Enigma-Dialect/test/run_tests.sh # MLIR/lit suite
bash Enigma-Dialect/test/run_tests.sh --gpu # plus GPU dispatchMIT. See LICENSE.
v0.1.1 — first PyPI release. Six merged wheels published
(enigma_dsl-0.1.1-cp{311,312,313}-cp{311,312,313}-macosx_{14_0,15_0}_arm64).
Installable via pip install enigma-dsl on any Apple Silicon Mac
running macOS 14+. Project page: https://pypi.org/project/enigma-dsl/.
v0.1.0 — initial release. Layout algebra engine (composition, complement, coalesce, zipped divide, recast, TV layout construction). Tracing IR with SSA values, constant folding, thread index decomposition. Metal emitter supporting scalar, float4 vector pointer, and TV-layout vectorised codegen. Swift runtime with device management, buffer allocation, synchronous dispatch, GPU timestamp measurement. Dialect TableGen definitions covering thread indexing, synchronisation, math, atomics, simdgroup, quad, geometry, pack/unpack, and matrix ops.
