Iter 117b-2: verify Triton entmax kernel against deep-spin/entmax reference

mzhong4 · claude · mzhong4 · commit f58907617b84 · 2026-04-30T03:28:57.000-05:00
Cross-checked Triton forward + backward formulation against the official deep-spin/entmax Entmax15Function source (commit-current as of 2026-04-30): - Forward: equivalent to reference modulo X/2 vs 0.5*(z-tau) factoring (algebra: reference's tau' = our tau / 2; output Y = max(X'-tau', 0)² = max(z/2 - tau/2, 0)² = max(0.5(z-tau), 0)² = our w). Matches our existing pure-PyTorch entmax_1p5 in train_gpt.py. - Backward: matches deep-spin/entmax line-for-line. Reference: gppr = sqrt(Y); dX = dY*gppr; q = dX.sum(dim)/gppr.sum(dim); dX -= q*gppr Our kernel: s = sqrt(w); c = sum(s*grad_w)/sum(s); grad_z = s*(grad_w - c) Identical (dX.sum = sum(grad_w * sqrt(w)) = sum(s*grad_w)). - Numerical stability: our discr.clamp_min(1e-6) is STRICTER than the reference's clamp(delta, 0); the reference has a latent sqrt(0) backward NaN bug (sqrt(0) gradient = Inf → 0*Inf = NaN under chain rule with downstream zero coefficients) which we already fixed in iter 117 v3 (commit a9ec303 → 339adfc). Sources: - https://github.com/deep-spin/entmax/blob/master/entmax/activations.py - https://arxiv.org/pdf/1905.05702 (Peters/Niculae/Martins 2019, §3 Algorithm 2 + Proposition 2) Updated experiments/test_entmax_triton.py header to document the verification chain. Kernel is correctness-verified by reference review; empirical numerical-equivalence tests still gated on iter 117b-1 finishing (GPUs currently saturated by iter 117b-1 training). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/experiments/test_entmax_triton.py b/experiments/test_entmax_triton.py
@@ -3,6 +3,27 @@
 Validates a Triton-fused entmax-1.5 forward + backward kernel against the
 pure-PyTorch closed-form implementation in train_gpt.py::entmax_1p5.
 
+ALGORITHM VERIFICATION (2026-04-30, deep-spin/entmax cross-check):
+  - Forward formulation matches train_gpt.py::entmax_1p5, which is
+    mathematically equivalent to deep-spin/entmax Entmax15Function.forward
+    (the reference uses X' = X/2 + tau'; we use the un-halved form with
+    `0.5*(z - tau)` inside the square; algebraically tau = 2*tau').
+  - Backward formula MATCHES deep-spin/entmax Entmax15Function.backward
+    line-for-line:
+        Reference:  gppr = sqrt(Y); dX = dY*gppr;
+                    q = dX.sum(dim)/gppr.sum(dim); dX -= q * gppr
+        This kernel: s = sqrt(w); c = sum(s*grad_w)/sum(s);
+                     grad_z = s * (grad_w - c)
+    Identical (note dX.sum = sum(dY*gppr) = sum(grad_w*sqrt(w)) = sum(s*grad_w)).
+  - Reference: https://github.com/deep-spin/entmax/blob/master/entmax/activations.py
+  - Paper: Peters, Niculae, Martins (2019) "Sparse Sequence-to-Sequence Models"
+    https://arxiv.org/pdf/1905.05702 (Algorithm 2 + Proposition 2 backward).
+  - Numerical stability: our `discr.clamp_min(1e-6)` is STRICTER than the
+    reference's `clamp(delta, 0)`; this is the iter 117 v3 NaN fix
+    (sqrt(0) backward = Inf → 0×Inf = NaN propagation; ε=1e-6 caps the
+    sqrt-gradient at 500, fixing a NaN bug not present in the reference).
+
+
 The kernel is designed for the small-E regime (E=16 routed experts) where
 all E values fit in registers and a single program block handles one row.