Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions .claude/skills/adding-cutile-kernel/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
---
name: adding-cutile-kernel
description: Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.
license: MIT. Complete terms in LICENSE.
---

# Adding a cuTile Kernel to TileGym

End-to-end workflow for adding a new operator (e.g., `my_op`) with cuTile backend.

## Execution Rules

**MUST follow these rules strictly:**
1. Use TodoWrite to create the checklist below BEFORE writing any code
2. Execute steps **in order** — do NOT skip ahead or combine steps
3. Mark each todo as `completed` after finishing, `in_progress` when starting
4. If a step is not applicable (e.g., no cuTile impl), mark it `completed` with a note, do NOT silently skip
5. Each step MUST result in a file write or explicit skip decision — no silent omissions

## Workflow

MUST copy this checklist to TodoWrite at the start:

```
- [ ] Step 1: Register dispatch interface in ops.py
- [ ] Step 2: Implement cuTile backend
- [ ] Step 3: Register in __init__.py (cutile)
- [ ] Step 4: Add tests
- [ ] Step 5: Add benchmark to tests/benchmark
- [ ] Step 6: Verify (run pytest + lint)
```

## Step 1: Register dispatch interface

**File**: `src/tilegym/ops/ops.py`

Add a `@dispatch` function — this is the **single entry point** for all backends.

```python
@dispatch(
"my_op",
)
def my_op(
input: torch.Tensor,
out: Optional[torch.Tensor] = None,
**kwargs: Any,
):
"""
Description of my_op.

Args:
input: Input tensor
out: Optional preallocated output tensor
**kwargs: Additional arguments for backend-specific configurations

Returns:
torch.Tensor
"""
raise NotImplementedError(f"my_op is not implemented for {get_current_backend()}")
```

**Key rules:**
- Function body only raises `NotImplementedError`
- Include `**kwargs` for backend-specific parameters

**Reference**: See existing ops in `src/tilegym/ops/ops.py` (e.g., `silu_and_mul`, `softmax`)

## Step 2: Implement cuTile backend

**File**: `src/tilegym/ops/cutile/my_op.py`

The file structure follows this template:

```python
import torch
import cuda.tile as ct

from tilegym.backend import register_impl


@ct.kernel
def my_op_kernel_ct(x, output, n_elements: ct.Constant[int], BLOCK_SIZE: ct.Constant[int]):
bid = ct.bid(0)
indices = bid * BLOCK_SIZE + ct.arange(0, BLOCK_SIZE)
x_val = ct.gather(x, indices)
# ... compute ...
ct.scatter(output, indices, result)


@register_impl("my_op", backend="cutile")
def my_op(input: torch.Tensor, out: torch.Tensor = None, **kwargs) -> torch.Tensor:
n = input.numel()
if out is None:
out = torch.empty_like(input)
grid = ((n + 1023) // 1024,)
ct.launch(stream, grid, kernel, (some args, ...))
return out
```

**Reference**: `src/tilegym/ops/cutile/silu_and_mul.py`

## Step 3: Register in `__init__.py` (CRITICAL)

Missing this step means the cuTile backend implementation never gets loaded.

**File**: `src/tilegym/ops/cutile/__init__.py`

Add inside `if is_backend_available("cutile"):` block (alphabetically):

```python
from . import my_op
```

And in the function import section:

```python
from .my_op import my_op
```

And add `"my_op"` to `__all__`.

## Step 4: Add tests

**File**: `tests/ops/test_my_op.py`

**CRITICAL**: Always import from `tilegym.ops`, NEVER from `tilegym.ops.cutile.my_op`.

```python
import pytest
import torch

from tilegym.backend import is_backend_available, set_backend
from .. import common

_backends = ["cutile"]


class Test_MY_OP(common.PyTestCase):
@staticmethod
def reference(input):
"""Reference implementation using PyTorch."""
return torch.some_reference(input)

@pytest.mark.parametrize("shape, dtype", [
((1024,), torch.float16),
((1024, 512), torch.float32),
((64, 64, 64), torch.bfloat16),
])
@pytest.mark.parametrize("backend", _backends)
def test_op(self, shape, dtype, backend, arch):
if backend == "cutile" and not is_backend_available("cutile"):
pytest.skip("Cutile backend not available")
try:
set_backend(backend)
except Exception as e:
pytest.skip(f"Backend is not supported: {e}")

self.setUp()

from tilegym.ops import my_op

A = torch.randn(*shape, dtype=dtype, device="cuda")
self.assertCorrectness(
my_op, self.reference, {"input": A},
atol=1e-3, rtol=1e-3,
)
```

**Key patterns:**
- `_backends = ["cutile"]`
- `test_op`: use `set_backend(backend)` with try-except, call `self.setUp()`

**Reference**: `tests/ops/test_silu_and_mul.py`

Below is the common errors.
```
1. Missing _backends list (inside class)
2. test_op / test_op_xxx — missing @pytest.mark.parametrize("backend", _backends), backend parameter, and tilegym.is_backend_available / tilegym.set_backend pattern
```

## Step 5: Add benchmark to tests/benchmark

**File**: `tests/benchmark/bench_my_op.py`

**Key rules from benchmark_rules.md:**
- Call the op via `tilegym.ops.my_op(a, b, ..., backend=backend)` — do **not** use `set_backend`.
- Define `ALL_BACKENDS` (include at least `cutile` and `torch`), filter with `get_supported_backends()`.
- Implement `reference_my_op(...)` and register it: `register_impl("my_op", "torch")(reference_my_op)`.
- Use `create_benchmark_config()` to build `triton.testing.Benchmark` configs (e.g. by shape/dtype).
- Use `@triton.testing.perf_report([...])` on `bench_my_op(...)`; inside the bench function: correctness check with `torch.testing.assert_close(fn(), ref(), ...)`, then `ms = triton.testing.do_bench(fn)` (or `do_bench_cudagraph`), compute GB/s or TFLOPS, and return the metric.
- Entry point: `if __name__ == "__main__": bench_my_op.run(print_data=True)`.

Template structure:

```python
import torch
import triton
import triton.testing

import tilegym
from tilegym.backend import is_backend_available, register_impl

ALL_BACKENDS = [
("cutile", "cuTile", ("orange", "-")) if is_backend_available("cutile") else None,
("torch", "PyTorch", ("green", "-")),
]

def get_supported_backends():
return [p for p in ALL_BACKENDS if p is not None]

def reference_my_op(input: torch.Tensor, out: torch.Tensor = None, **kwargs):
"""Reference implementation using PyTorch."""
...

register_impl("my_op", "torch")(reference_my_op)

def create_benchmark_config(datatype, ...):
available_backends = get_supported_backends()
if not available_backends:
return None
backends, names, styles = zip(*available_backends)
return triton.testing.Benchmark(
x_names=["M"], # or other dimension names
x_vals=[...],
line_arg="backend",
line_vals=list(backends),
line_names=list(names),
styles=list(styles),
ylabel="GB/s", # or TFLOPS
plot_name="my-op-...",
args={"datatype": datatype, ...},
)

@triton.testing.perf_report([
create_benchmark_config(datatype, ...)
for datatype in [torch.float16, torch.float32]
for ... in [...]
])
def bench_my_op(M, backend, datatype, ..., device="cuda"):
x = torch.randn(..., dtype=datatype, device=device)

fn = lambda: tilegym.ops.my_op(x, backend=backend)
ref = lambda: reference_my_op(x)
torch.testing.assert_close(fn(), ref(), rtol=1e-2, atol=1e-2)

ms = triton.testing.do_bench(fn) # or do_bench_cudagraph(fn)
# Compute metric (e.g. GB/s or TFLOPS) from ms and problem size
return metric

if __name__ == "__main__":
bench_my_op.run(print_data=True)
```

**Benchmark Plot Names**: Must include `-TFLOPS` or `-GBps` suffix
- Example: `plot_name=f"persistent-layer-norm-M{num_rows}-{dtype_name}-GBps"`

## Step 6: Verify

```bash
# Run tests
pytest tests/ops/test_my_op.py -v

# Run benchmark (optional)
python tests/benchmark/bench_my_op.py

# Lint
pre-commit run -a
```
85 changes: 84 additions & 1 deletion .github/scripts/check_spdx_headers.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import sys
from pathlib import Path
from typing import Dict
from typing import Iterator
from typing import List
from typing import Optional
from typing import Tuple
Expand Down Expand Up @@ -77,7 +78,17 @@ def should_skip_file(file_path: Path) -> bool:
return True

# Skip directories
dir_patterns = ["__pycache__", ".pytest_cache", "node_modules", "venv", "env", ".egg-info", "dist", "build"]
dir_patterns = [
"__pycache__",
".pytest_cache",
"node_modules",
"venv",
"env",
".egg-info",
"dist",
"build",
".claude",
]
for pattern in dir_patterns:
if pattern in file_path.parts:
return True
Expand Down Expand Up @@ -218,6 +229,62 @@ def find_files(root_dir: Path) -> List[Path]:
return files


# License field to insert into SKILL.md frontmatter.
SKILL_LICENSE_LINE = "license: MIT. Complete terms in LICENSE."


def iter_skill_files(root_dir: Path) -> Iterator[Path]:
"""Yield SKILL.md files under .claude/skills/."""
skills_dir = root_dir / ".claude" / "skills"
if not skills_dir.is_dir():
return
for skill_dir in sorted(skills_dir.iterdir()):
skill_md = skill_dir / "SKILL.md"
if skill_md.is_file():
yield skill_md


def has_skill_license(content: str) -> bool:
"""Check if a SKILL.md file has a 'license:' field in its YAML frontmatter."""
lines = content.split("\n")
if not lines or lines[0].strip() != "---":
return False
for i, line in enumerate(lines[1:], start=1):
if line.strip() == "---":
frontmatter = "\n".join(lines[1:i])
return "license:" in frontmatter
return False


def add_skill_license(file_path: Path) -> bool:
"""Add license field to the YAML frontmatter of a SKILL.md file."""
try:
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()

if has_skill_license(content):
return False

lines = content.split("\n")
if not lines or lines[0].strip() != "---":
return False

for i, line in enumerate(lines[1:], start=1):
if line.strip() == "---":
lines.insert(i, SKILL_LICENSE_LINE)
break
else:
return False

with open(file_path, "w", encoding="utf-8") as f:
f.write("\n".join(lines))
return True

except Exception as e:
print(f"Error processing {file_path}: {e}", file=sys.stderr)
return False


def action_write(root_dir: Path) -> int:
"""Add SPDX headers to files that are missing them."""
files = find_files(root_dir)
Expand All @@ -232,6 +299,12 @@ def action_write(root_dir: Path) -> int:
print(f"Added header to: {file_path.relative_to(root_dir)}")
modified_count += 1

# Also handle SKILL.md files under .claude/skills/
for skill_md in iter_skill_files(root_dir):
if add_skill_license(skill_md):
print(f"Added license to frontmatter: {skill_md.relative_to(root_dir)}")
modified_count += 1

print(f"\nModified {modified_count} file(s)")
return 0

Expand All @@ -245,6 +318,16 @@ def action_check(root_dir: Path) -> int:
if not check_file(file_path):
missing_headers.append(file_path)

# Also check SKILL.md files under .claude/skills/
for skill_md in iter_skill_files(root_dir):
try:
with open(skill_md, "r", encoding="utf-8") as f:
content = f.read()
if not has_skill_license(content):
missing_headers.append(skill_md)
except Exception as e:
print(f"Error reading {skill_md}: {e}", file=sys.stderr)

if missing_headers:
print("❌ The following files are missing SPDX headers:\n")
for file_path in missing_headers:
Expand Down
2 changes: 2 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ If you are adding a **new kernel** (new `@ct.kernel` / new op implementation) th

New cuTile kernel contributions should first be placed in the `experimental/` directories. Once the TileGym team has fully verified functional correctness and performance, kernels will be promoted from `experimental/` into the main source tree.

We provide `adding-cutile-kernel` skill for AI agent to add new kernels in this repo.

##### Directory structure

```
Expand Down
Loading