NVIDIA · hannahli-nv · Mar 6, 2026 · Mar 4, 2026 · Mar 6, 2026 · Mar 6, 2026
diff --git a/.claude/skills/adding-cutile-kernel/SKILL.md b/.claude/skills/adding-cutile-kernel/SKILL.md
@@ -0,0 +1,268 @@
+---
+name: adding-cutile-kernel
+description: Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.
+license: MIT. Complete terms in LICENSE.
+---
+
+# Adding a cuTile Kernel to TileGym
+
+End-to-end workflow for adding a new operator (e.g., `my_op`) with cuTile backend.
+
+## Execution Rules
+
+**MUST follow these rules strictly:**
+1. Use TodoWrite to create the checklist below BEFORE writing any code
+2. Execute steps **in order** — do NOT skip ahead or combine steps
+3. Mark each todo as `completed` after finishing, `in_progress` when starting
+4. If a step is not applicable (e.g., no cuTile impl), mark it `completed` with a note, do NOT silently skip
+5. Each step MUST result in a file write or explicit skip decision — no silent omissions
+
+## Workflow
+
+MUST copy this checklist to TodoWrite at the start:
+
+```
+- [ ] Step 1: Register dispatch interface in ops.py
+- [ ] Step 2: Implement cuTile backend
+- [ ] Step 3: Register in __init__.py (cutile)
+- [ ] Step 4: Add tests
+- [ ] Step 5: Add benchmark to tests/benchmark
+- [ ] Step 6: Verify (run pytest + lint)
+```
+
+## Step 1: Register dispatch interface
+
+**File**: `src/tilegym/ops/ops.py`
+
+Add a `@dispatch` function — this is the **single entry point** for all backends.
+
+```python
+@dispatch(
+    "my_op",
+)
+def my_op(
+    input: torch.Tensor,
+    out: Optional[torch.Tensor] = None,
+    **kwargs: Any,
+):
+    """
+    Description of my_op.
+
+    Args:
+        input: Input tensor
+        out: Optional preallocated output tensor
+        **kwargs: Additional arguments for backend-specific configurations
+
+    Returns:
+        torch.Tensor
+    """
+    raise NotImplementedError(f"my_op is not implemented for {get_current_backend()}")
+```
+
+**Key rules:**
+- Function body only raises `NotImplementedError`
+- Include `**kwargs` for backend-specific parameters
+
+**Reference**: See existing ops in `src/tilegym/ops/ops.py` (e.g., `silu_and_mul`, `softmax`)
+
+## Step 2: Implement cuTile backend
+
+**File**: `src/tilegym/ops/cutile/my_op.py`
+
+The file structure follows this template:
+
+```python
+import torch
+import cuda.tile as ct
+
+from tilegym.backend import register_impl
+
+
+@ct.kernel
+def my_op_kernel_ct(x, output, n_elements: ct.Constant[int], BLOCK_SIZE: ct.Constant[int]):
+    bid = ct.bid(0)
+    indices = bid * BLOCK_SIZE + ct.arange(0, BLOCK_SIZE)
+    x_val = ct.gather(x, indices)
+    # ... compute ...
+    ct.scatter(output, indices, result)
+
+
+@register_impl("my_op", backend="cutile")
+def my_op(input: torch.Tensor, out: torch.Tensor = None, **kwargs) -> torch.Tensor:
+    n = input.numel()
+    if out is None:
+        out = torch.empty_like(input)
+    grid = ((n + 1023) // 1024,)
+    ct.launch(stream, grid, kernel, (some args, ...))
+    return out
+```
+
+**Reference**: `src/tilegym/ops/cutile/silu_and_mul.py`
+
+## Step 3: Register in `__init__.py` (CRITICAL)
+
+Missing this step means the cuTile backend implementation never gets loaded.
+
+**File**: `src/tilegym/ops/cutile/__init__.py`
+
+Add inside `if is_backend_available("cutile"):` block (alphabetically):
+
+```python
+from . import my_op
+```
+
+And in the function import section:
+
+```python
+from .my_op import my_op
+```
+
+And add `"my_op"` to `__all__`.
+
+## Step 4: Add tests
+
+**File**: `tests/ops/test_my_op.py`
+
+**CRITICAL**: Always import from `tilegym.ops`, NEVER from `tilegym.ops.cutile.my_op`.
+
+```python
+import pytest
+import torch
+
+from tilegym.backend import is_backend_available, set_backend
+from .. import common
+
+_backends = ["cutile"]
+
+
+class Test_MY_OP(common.PyTestCase):
+    @staticmethod
+    def reference(input):
+        """Reference implementation using PyTorch."""
+        return torch.some_reference(input)
+
+    @pytest.mark.parametrize("shape, dtype", [
+        ((1024,), torch.float16),
+        ((1024, 512), torch.float32),
+        ((64, 64, 64), torch.bfloat16),
+    ])
+    @pytest.mark.parametrize("backend", _backends)
+    def test_op(self, shape, dtype, backend, arch):
+        if backend == "cutile" and not is_backend_available("cutile"):
+            pytest.skip("Cutile backend not available")
+        try:
+            set_backend(backend)
+        except Exception as e:
+            pytest.skip(f"Backend is not supported: {e}")
+
+        self.setUp()
+
+        from tilegym.ops import my_op
+
+        A = torch.randn(*shape, dtype=dtype, device="cuda")
+        self.assertCorrectness(
+            my_op, self.reference, {"input": A},
+            atol=1e-3, rtol=1e-3,
+        )
+```
+
+**Key patterns:**
+- `_backends = ["cutile"]`
+- `test_op`: use `set_backend(backend)` with try-except, call `self.setUp()`
+
+**Reference**: `tests/ops/test_silu_and_mul.py`
+
+Below is the common errors.
+```
+1. Missing _backends list (inside class)
+2. test_op / test_op_xxx — missing @pytest.mark.parametrize("backend", _backends), backend parameter, and tilegym.is_backend_available / tilegym.set_backend pattern
+```
+
+## Step 5: Add benchmark to tests/benchmark
+
+**File**: `tests/benchmark/bench_my_op.py`
+
+**Key rules from benchmark_rules.md:**
+- Call the op via `tilegym.ops.my_op(a, b, ..., backend=backend)` — do **not** use `set_backend`.
+- Define `ALL_BACKENDS` (include at least `cutile` and `torch`), filter with `get_supported_backends()`.
+- Implement `reference_my_op(...)` and register it: `register_impl("my_op", "torch")(reference_my_op)`.
+- Use `create_benchmark_config()` to build `triton.testing.Benchmark` configs (e.g. by shape/dtype).
+- Use `@triton.testing.perf_report([...])` on `bench_my_op(...)`; inside the bench function: correctness check with `torch.testing.assert_close(fn(), ref(), ...)`, then `ms = triton.testing.do_bench(fn)` (or `do_bench_cudagraph`), compute GB/s or TFLOPS, and return the metric.
+- Entry point: `if __name__ == "__main__": bench_my_op.run(print_data=True)`.
+
+Template structure:
+
+```python
+import torch
+import triton
+import triton.testing
+
+import tilegym
+from tilegym.backend import is_backend_available, register_impl
+
+ALL_BACKENDS = [
+    ("cutile", "cuTile", ("orange", "-")) if is_backend_available("cutile") else None,
+    ("torch", "PyTorch", ("green", "-")),
+]
+
+def get_supported_backends():
+    return [p for p in ALL_BACKENDS if p is not None]
+
+def reference_my_op(input: torch.Tensor, out: torch.Tensor = None, **kwargs):
+    """Reference implementation using PyTorch."""
+    ...
+
+register_impl("my_op", "torch")(reference_my_op)
+
+def create_benchmark_config(datatype, ...):
+    available_backends = get_supported_backends()
+    if not available_backends:
+        return None
+    backends, names, styles = zip(*available_backends)
+    return triton.testing.Benchmark(
+        x_names=["M"],  # or other dimension names
+        x_vals=[...],
+        line_arg="backend",
+        line_vals=list(backends),
+        line_names=list(names),
+        styles=list(styles),
+        ylabel="GB/s",  # or TFLOPS
+        plot_name="my-op-...",
+        args={"datatype": datatype, ...},
+    )
+
+@triton.testing.perf_report([
+    create_benchmark_config(datatype, ...)
+    for datatype in [torch.float16, torch.float32]
+    for ... in [...]
+])
+def bench_my_op(M, backend, datatype, ..., device="cuda"):
+    x = torch.randn(..., dtype=datatype, device=device)
+
+    fn = lambda: tilegym.ops.my_op(x, backend=backend)
+    ref = lambda: reference_my_op(x)
+    torch.testing.assert_close(fn(), ref(), rtol=1e-2, atol=1e-2)
+
+    ms = triton.testing.do_bench(fn)  # or do_bench_cudagraph(fn)
+    # Compute metric (e.g. GB/s or TFLOPS) from ms and problem size
+    return metric
+
+if __name__ == "__main__":
+    bench_my_op.run(print_data=True)
+```
+
+**Benchmark Plot Names**: Must include `-TFLOPS` or `-GBps` suffix
+  - Example: `plot_name=f"persistent-layer-norm-M{num_rows}-{dtype_name}-GBps"`
+
+## Step 6: Verify
+
+```bash
+# Run tests
+pytest tests/ops/test_my_op.py -v
+
+# Run benchmark (optional)
+python tests/benchmark/bench_my_op.py
+
+# Lint
+pre-commit run -a
+```
diff --git a/.github/scripts/check_spdx_headers.py b/.github/scripts/check_spdx_headers.py
@@ -16,6 +16,7 @@
 import sys
 from pathlib import Path
 from typing import Dict
+from typing import Iterator
 from typing import List
 from typing import Optional
 from typing import Tuple
@@ -77,7 +78,17 @@ def should_skip_file(file_path: Path) -> bool:
         return True
 
     # Skip directories
-    dir_patterns = ["__pycache__", ".pytest_cache", "node_modules", "venv", "env", ".egg-info", "dist", "build"]
+    dir_patterns = [
+        "__pycache__",
+        ".pytest_cache",
+        "node_modules",
+        "venv",
+        "env",
+        ".egg-info",
+        "dist",
+        "build",
+        ".claude",
+    ]
     for pattern in dir_patterns:
         if pattern in file_path.parts:
             return True
@@ -218,6 +229,62 @@ def find_files(root_dir: Path) -> List[Path]:
     return files
 
 
+# License field to insert into SKILL.md frontmatter.
+SKILL_LICENSE_LINE = "license: MIT. Complete terms in LICENSE."
+
+
+def iter_skill_files(root_dir: Path) -> Iterator[Path]:
+    """Yield SKILL.md files under .claude/skills/."""
+    skills_dir = root_dir / ".claude" / "skills"
+    if not skills_dir.is_dir():
+        return
+    for skill_dir in sorted(skills_dir.iterdir()):
+        skill_md = skill_dir / "SKILL.md"
+        if skill_md.is_file():
+            yield skill_md
+
+
+def has_skill_license(content: str) -> bool:
+    """Check if a SKILL.md file has a 'license:' field in its YAML frontmatter."""
+    lines = content.split("\n")
+    if not lines or lines[0].strip() != "---":
+        return False
+    for i, line in enumerate(lines[1:], start=1):
+        if line.strip() == "---":
+            frontmatter = "\n".join(lines[1:i])
+            return "license:" in frontmatter
+    return False
+
+
+def add_skill_license(file_path: Path) -> bool:
+    """Add license field to the YAML frontmatter of a SKILL.md file."""
+    try:
+        with open(file_path, "r", encoding="utf-8") as f:
+            content = f.read()
+
+        if has_skill_license(content):
+            return False
+
+        lines = content.split("\n")
+        if not lines or lines[0].strip() != "---":
+            return False
+
+        for i, line in enumerate(lines[1:], start=1):
+            if line.strip() == "---":
+                lines.insert(i, SKILL_LICENSE_LINE)
+                break
+        else:
+            return False
+
+        with open(file_path, "w", encoding="utf-8") as f:
+            f.write("\n".join(lines))
+        return True
+
+    except Exception as e:
+        print(f"Error processing {file_path}: {e}", file=sys.stderr)
+        return False
+
+
 def action_write(root_dir: Path) -> int:
     """Add SPDX headers to files that are missing them."""
     files = find_files(root_dir)
@@ -232,6 +299,12 @@ def action_write(root_dir: Path) -> int:
             print(f"Added header to: {file_path.relative_to(root_dir)}")
             modified_count += 1
 
+    # Also handle SKILL.md files under .claude/skills/
+    for skill_md in iter_skill_files(root_dir):
+        if add_skill_license(skill_md):
+            print(f"Added license to frontmatter: {skill_md.relative_to(root_dir)}")
+            modified_count += 1
+
     print(f"\nModified {modified_count} file(s)")
     return 0
 
@@ -245,6 +318,16 @@ def action_check(root_dir: Path) -> int:
         if not check_file(file_path):
             missing_headers.append(file_path)
 
+    # Also check SKILL.md files under .claude/skills/
+    for skill_md in iter_skill_files(root_dir):
+        try:
+            with open(skill_md, "r", encoding="utf-8") as f:
+                content = f.read()
+            if not has_skill_license(content):
+                missing_headers.append(skill_md)
+        except Exception as e:
+            print(f"Error reading {skill_md}: {e}", file=sys.stderr)
+
     if missing_headers:
         print("❌ The following files are missing SPDX headers:\n")
         for file_path in missing_headers:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -62,6 +62,8 @@ If you are adding a **new kernel** (new `@ct.kernel` / new op implementation) th
 
 New cuTile kernel contributions should first be placed in the `experimental/` directories. Once the TileGym team has fully verified functional correctness and performance, kernels will be promoted from `experimental/` into the main source tree.
 
+We provide `adding-cutile-kernel` skill for AI agent to add new kernels in this repo.
+
 ##### Directory structure
 
 ```
-Original file line number
+Diff line change
@@ Expand Up @@
     New cuTile kernel contributions should first be placed in the `experimental/` directories. Once the TileGym team has fully verified functional correctness and performance, kernels will be promoted from `experimental/` into the main source tree.
+    We provide `adding-cutile-kernel` skill for AI agent to add new kernels in this repo.
     ##### Directory structure
     ```
@@ Expand Down @@