[WIP] Optimized Muon/Architecture research by @NOPIMPOSSSIBLEWHY by NOPIMPOSSSIBLEWHY · Pull Request #4 · openai/parameter-golf

NOPIMPOSSSIBLEWHY · 2026-03-18T18:07:54Z

Research starting on local MLX (Mac M3). Benchmarking architectures for the 16MB limit using Muon and muP.

Novel techniques from the top 2 leaderboard entries: 1. BigramHash (BIGRAM_BUCKETS=4096, BIGRAM_DIM=128): - Hash consecutive token pairs → embedding lookup → project to model_dim - XOR with coprime multipliers for hash function - Captures local bigram context (~524K params for 4096 buckets) - Used by openai#1 (thwu1, 1.1428 BPB) and openai#2 (Raahil Shah, 1.1458 BPB) 2. SmearGate (SMEAR_GATE=1): - Learned per-dim gate blending current token with previous token - Applied after embedding normalization - Only ~512 params - Used by openai#2 and openai#4 Both are env-var controlled (0=disabled by default). run_v7_full.sh enables everything for the full stack. Also fixed: BigramHash/SmearGate params added to optimizer groups. 1438 lines (62 under 1500 limit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future improvements must be orthogonal to TTT. This update: - Sets 1.0781 BPB (PR openai#672) as the new target to beat - Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2, SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6, depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8 - Deprioritizes TTT-related directions already exploited by PR openai#672 - Collapses ~1000 lines of stale Round 0-3.9 session logs into a concise historical summary - Removes resolved blockers (flash_attn, SSH hangs, local runtime) - Adds fresh Round 1 section with 5 submitted experiments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix openai#1: ternary roundtrip eval on ALL ranks with dist.broadcast (was: only rank 0 loaded weights → invalid eval results) - Fix openai#2: pass pre-computed scales to export (avoids double-quantization) - Fix openai#3: keep scales as float32 (was: lossy float16 cast) - Fix openai#4: import returns float32 (was: lossy bfloat16 cast) - Fix openai#5: lower z_loss from 1e-4 to 1e-5 (prevents loss explosion) - Fix openai#6: add dist.broadcast after int8 roundtrip load too - Fix openai#7: add weights_only=False to suppress FutureWarning Ternary roundtrip is now LOSSLESS (max error = 0.0). The previous val_bpb=0.9650 was an artifact of bug openai#1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Built on SOTA openai#1 (PR openai#1019) + Trinity ternary for MLP layers. Key change: MLP 5x width (ternary weights are cheap) vs SOTA's 3x. 8xH100 SXM results: - 4837 steps in 10 min (123ms/step) - val_bpb: 1.2361 (step 2000) → 1.1611 (step 4000) → 1.1357 (step 4837) - Beats baseline (1.2244) and ternary submission (1.1570) - Close to SOTA openai#4 (1.1307) Known issue: hybrid export pipeline (ternary MLP + int6 GPTQ attn) produces val_bpb=3.97 on roundtrip — needs debugging. Training result is valid; export/quantization needs fixing. Trinity contributions: - Ternary absmean quantization for MLP (from ternary_pipeline.zig) - Base-3 packing (5 trits/byte, from ternary_packing.zig) - Wider MLP (5x vs 3x) enabled by ternary compression savings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add low-rank factored MLP layers (LowRankLinear, LowRankMLP) that decompose weight matrices as A @ B where A: (in_dim, rank) and B: (rank, out_dim). This trades per-layer MLP capacity for the ability to run more layers within the same parameter budget (e.g., 15 layers with rank-128 MLPs instead of 9 layers with full-rank MLPs). Changes: - Add MLP_RANK env var (default 0 = full-rank, >0 = low-rank factored) - Add LowRankLinear module with orthogonal init, fp32 storage, bf16 compute - Add LowRankMLP module using relu^2 activation with low-rank layers - Block dispatches to LowRankMLP when MLP_RANK > 0 - GPT.forward_logits() returns logits without loss (for sliding-window eval) - eval_val_sliding() for overlapping-window BPB evaluation - LowRankLinear params are 2D matrices, fully Muon-compatible - Quantization handles A/B factors automatically (per-row int8 on 2D tensors) - Zero-init on projection layer B factor for residual-friendly init - Backward compatible: MLP_RANK=0 preserves original full-rank behavior Suggested test: NUM_LAYERS=15 MLP_RANK=128 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MLP 3.25x on 8xH100 SXM, 10 min: - 5408 steps at 111ms/step - Training val_bpb: 1.1455 - Int6 GPTQ roundtrip: 1.1485 (standard), 1.1251 (sliding s64) - Artifact: 15.90MB (under 16MB limit!) - Pruning: only 1 value (0.0%) — nearly fits without pruning Leaderboard position: between openai#3 (1.1228) and openai#4 (1.1248) Trinity innovation: wider MLP (3.25x vs SOTA 3x) from ternary parameter budget analysis. All weights int6 GPTQ. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…er DEFERRED Subagent deep-dive of arxiv:2410.05258 (Microsoft DIFF Transformer): zero comp-PR coverage but smallest tested model is 830M (38x ours) and the learnable lambda has known NaN failure modes that violate the "degrades gracefully" constraint. Logged with alternative architectures to investigate next fire (GLA, FusionNet, YOCO) and explicitly chose NOT to push junk per user instruction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… confirmation), MR2 promising, PR openai#1430 MERGED at 0.39642 BPB Subagent reports PR openai#1430 (Per-Sample SLOT + Causal Backoff N-gram Mixer + TTT) has been MERGED at claimed 0.39642 BPB — 65% below public SOTA. If real, this fundamentally changes the competitive landscape. Audit fires openai#1-3 all flagged this PR as likely illegal under issue openai#677. Now MERGED. NEXT RESEARCH FIRE PRIORITY: deep-dive PR openai#1430 to verify legality and extract implementation. If real, port it. If leak-based, document it. Patches 17 (Mousse) and 18 (MuonEq-R) confirmed as known PORTS, not novel-to-comp. They were always documented as ports in research fires openai#9 and openai#10. Patches 15/16/21 still uncontested in 120+ open + 10 closed PRs (4 audits in a row). Pod healthy, ~$2.30/$36 spend. MR2_seed42 = 3.3004 (better than MS2 = 3.3358), suggesting MuonEq-R may slightly beat Mousse at L5 stack. Falsification of Patches 17 and 18 proceeding rapidly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… merged, 0.39642 confirmed Critical correction: previous audit fire openai#4 incorrectly reported PR openai#1430 as merged. State = open, merged_at = null, 0 LGTMs, 0 comp owner reviews. The 0.39642 BPB score IS confirmed in the PR README (3-seed mean) but the submission is unverified. Subagent deep code read confirms three techniques (Per-Sample SLOT, Causal Backoff N-gram Mixer order-22, post-quant TTT) all pass the strict letter of issue openai#677 four conditions (causal, score-before-update, single-pass, full-normalized). But the SPIRIT of openai#677 is borderline — 196K per-sequence params trained on val set is essentially val-set overfitting "legally". DO NOT PORT this fire because: 1. PR openai#1430 has zero LGTMs and may get reverted 2. All 3 techniques are eval-time (can't validate on our cheap-GPU loop) 3. Better H100 escalation candidates already deferred (EMA, Tilt, INT6 GPTQ) Watch PR openai#1430 every 2 hours; if merged with comp owner approval, port at next research fire. If reverted or outlawed, mark dead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

User has rejected H100 launches twice in this campaign. Removing all "runpodctl create pod H100" paths and replacing the S2 confirmation gate with cheap-pod runs at SKIP_FINAL_EVAL=0 + MAX_WALLCLOCK_SECONDS=600. Concrete changes: - S2 confirmation gate: now runs on the SAME cheap pod that did S1 - Pod assignment table: removed H100_spot row - Cron schedule: C720 (H100 confirm every 12h) → C360 (cheap-pod confirm every 6h) - C360 prompt: appends S2_<id> rows to experiments.json with SKIP_FINAL_EVAL=0 instead of spinning up a pod - Spend ceiling: removed "Mac+H100 confirms only" tier — Mac research only - Risk openai#4: replaced "H100 spot price" with "cheap-pod val_bpb calibration" - Verification: final S2 metrics now measured on cheap pod (G1 floor 12.5M tok/min on 3080Ti, scales linearly to 8xH100 fleet) - Day-1 checklist: removed C720 reference, added C360 Factual mentions of "8xH100" as the OpenAI eval target are kept (that's the comp config and we never need to reproduce it ourselves). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rld-novel candidates C5 openai#19 results (6 new seed runs landed simultaneously): - L04_gated_attention_seed13 = 2.2206 ⭐ NEW SESSION BEST (n=4 mean 2.230125) - L08_normuon_seed7 = 2.2475 (n=3 mean 2.3323) - L09_entropy_adaptive_seed13 = 2.2543 (n=3 mean 2.3441) - L02_coprime_stride_seed7 = 2.2406 (n=4 mean 2.4191) - L06_ln_scale_seed7 = 2.2386 (n=4 mean 2.2857) - L07_byte_weight_seed7 = 2.2418 (n=3 mean 2.3236) L04 still champion. All 6 layers have converging means in the 2.23-2.42 range. 6/6 pods 90-100% util, no alarms. C30 openai#4 — mined 6 NEW world-novel candidates (3 L01 tokenizer, 3 L10 compression): L01 candidates (all world-novel): - TOK_entropy_patch_boundary_dynamic (Meta BLT entropy + sentencepiece fork, ~250 LOC) - TOK_morphology_aware_segmentation_fine_grain (Slovak SKMT, ~180 LOC) - TOK_adaptive_vocab_gradient_aware_training (joint train w/ Hessian, ~220 LOC) L10 candidates (all world-novel): - CMP_vq_learned_codebook_multilayer (RVQ + per-layer codebook + rANS, ~180 LOC) - CMP_asymmetric_numeric_systems_neural_prior (rANS + tiny neural prior, ~150 LOC) - CMP_tensor_train_int4_cores_mixed_precision (TT/MPO + int4 cores, ~220 LOC) All 6 passed the 5-check audit (literature, code, comp, PhD-defensibility) and got Section C audit blocks. Per the LOC-unlimited rule, these are big patches that were previously deferred — now first-class C90 build candidates. Total world-novel candidates queued: 23 → 29 (2 already shipped today as patches 26+27, 27 still untested in the C90 pipeline). Spend $0 (research only). Push: TBD. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s PD8 idle-CPU gap - runpod_tests/loop/cpu_workers.py: spawns N-2 (default cpu_count-2) workers per pod via multiprocessing. Each worker loops: pull job from data/cpu_jobs/pending/, dispatch by type, write result to done/. Atomic rename for exclusivity. PID-file guard so run_forever.sh can call it on every loop iter without fork-bombing. - Job handlers shipped: brotli_sweep — 0..11 brotli levels on int8.ptz files (feeds L10) ngram_table_inspect — nnz/sparsity/mean/max on .npy ngram tables (feeds L09) noop — smoke test - runpod_tests/loop/cpu_jobs_emitter.py: idempotent job emitter, called once per run_forever.sh outer loop iter; queues brotli_sweep on the most-recent 3 .ptz checkpoints + ngram_inspect on every .npy table. - run_forever.sh preflight() launches the worker pool + emitter on first call, guarded by PID file so re-launches are no-ops. - data/cpu_jobs/{pending,in_progress,done} dirs gitkeep'd; queue contents gitignored (per-pod state, not part of repo). Smoke test (Mac): both scripts import cleanly; emitter dry-run queued 21 ngram_inspect jobs into pending/. Cleared after smoke test. Addresses gap openai#2 from the 0648Z status report ("CPU sitting idle while GPU trains") and PD8 directive ("max out CPU+RAM, not just GPU"). Pods have 8-16 vCPUs sitting at <10% during training; this puts them to work on useful brotli/ngram analysis that feeds back into L09 + L10 design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ld-novel openai#4) # C5 status 5 pods alive (B/C/E/F/G), all SWEEP_DONE clean. Pod D still in network outage. # Patch 41: DYN_LYAPUNOV_CLIP_MARKER (world-novel L11 openai#4) - Adaptive gradient clipping driven by Lyapunov exponent estimation from rolling 20-step grad_norm history - Estimate λ₁ ≈ avg(log(g[i+1]/g[i])) over the window - When λ₁ > threshold (default 0.05 = 5% per-step growth), tighten clip from args.grad_clip_norm to (clip * exp(-λ₁ * 5)), bringing trajectory back to stable basin - Anchor: line 1030 grad_clip_norm_ call. Default OFF = bit-exact baseline. - World-novel: Oseledec multiplicative ergodic theorem applied to LM training is unpublished. AdaGC/AGGC use frequency-based clipping. 0 hits in arXiv/Scholar/GitHub for "lyapunov exponent gradient clip language model". - Stacks with all optimizer patches (NORMUON, MUONEQ_R, MOUSSE, OPT_CHEBYSHEV_NS, PER_PROJ_LR_SPLIT, WEIGHT_EMA_SWA) — clip is on grad before opt.step(). - Win mechanism: -0.008 to -0.015 train_loss via stability preserving step effectiveness (no oscillatory bifurcation episodes wasting gradient signal). - 2 test entries queued: L11_lyapclip_seed42/1337 → pod B - EXPECTED_MARKERS now 41 in both 08_patch and gate_check.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Late QAT: QAT now disabled from start, enabled only when LR scale drops below 0.15 (during warmdown). Avoids quantization noise during main training phase. - Partial RoPE: rotate only first 16 of 64 head_dim dims. Remaining 48 dims position-free. Matches PR openai#315 in leaderboard openai#4 entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

1. memmap zero-copy: load_data_shard now returns a torch view of the memmap (no np.array copy). Multiple DDP ranks share OS page cache. 2. FP32 eval accumulators: unrolled DEQ solver uses FP32 instead of FP64 at eval time. FP64 only needed inside RevDEQFunction for reversibility. 3. K-sweep extended to K=128: {4,8,16,32,64,128} with fast eval. 4. Fixed stale test: test_gate_init_defaults expected gg_gate.bias=0.0, now expects 1.5 (matching current init). 5. Graceful compile fallback: torch.compile on NS functions and shared_block wrapped in try/except for version robustness. Deferred (valid but larger scope): - openai#2 (wall-clock budget enforcement including compile+eval time) - openai#4 (final expert health assertion after int6 roundtrip) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 5e-1 position sweep openai#4. Mirror image of 27c (which injected router-only and regressed catastrophically: val_bpb 2.09, K=128 Δ=3.31). 27c's failure: experts got zero x0 signal → FP became x0-independent past K=16. 27d: routers read clean z_in; experts receive z_in + g_inj·x0. Hypothesis: experts need x0 for feature extraction to preserve FP x0-dependence; routing decisions should be state-only. Block.forward: x = z_in # router input: clean x_expert_in = z_in + g_inj·x0 # expert input: injected x_attn_router = attn_norm(x) w_attn = attn_router(x_attn_router, pre_normed=True) x_attn = attn_norm(x_expert_in) y_shared = attn._attn_shared_from_normed(x_attn) attn_mix = mix_experts_from_shared(y_shared, w_attn, inj_term=None) # same split for mlp raw_out = 0.5 * z2 Baseline: iter 27b-pos-expert-out (9edc6af, val_bpb=1.902, K=128 Δ=0.039). KEEP if val_bpb ≤ 1.917 AND K=128 Δ ≤ 0.5. Smoke: loss 7.33 → 4.35 (delta -2.98, healthy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openai#4 FREEZE router_gate — unconstrained W_g made router Lipschitz unbounded, invalidating L_w in the τ_max = c/L_G formula. Fix: weight=0, bias=5.0, requires_grad=False. openai#5 SpectralNormCap on ALL attention shared linear maps — c_q, c_kv_down, c_k_nope, c_v, c_k_rope were unconstrained CastedLinear. Doc §6.1 requires ‖W‖_2 ≤ 1 for L_attn bound to hold. openai#6 BOUND q_gain via sigmoid reparameterization — q_gain scales q_full, expanding effective query radius R_q = q_gain_max · R. Unbounded q_gain makes L_attn = 1+4γ(R_q+R)² unbounded. Fix: q_gain = q_gain_max · sigmoid(raw). openai#8 RECOMPUTE L_G from actual bounds — L_G now computed as named terms: L_attn (attention with q_gain), L_ffn (gated MLP conservative), L_w_l1 (router with √E norm conversion), B_tok (per-token output). All components logged. Honest τ_max: with q_gain_max=3.0, R=√d_head, γ=0.051: L_attn=314, L_G=354, τ_max=0.0027. Very small but RIGOROUSLY guaranteed: Lip(T_x) ≤ 0.95 < 1. 38/38 tests pass. Smoke deferred (GPUs busy with iter 39b take 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hogonal parameterization Replace all PerExpertSpectralNormCap (σ_max ≤ 1) and SpectralNormCap with OrthogonalParametrization using Newton-Schulz iteration (20 iters). All expert banks (5) and injection U now have ALL singular values = 1 (exact isometry), not just σ_max ≤ 1. Key design: - Stateless: no buffers, inherently RevDEQ-compatible (Permanent Protocol openai#4) - Deterministic: power-iteration scaling uses ones() init (not randn) - Cached: data_ptr()-based cache avoids redundant NS within DEQ solve - σ_max scaling: 3 power iterations estimate σ_max for safe NS convergence - refresh_spectral_norms() invalidates cache after optimizer.step() Smoke test: loss 7.3→2.9, recon_err=2.3e-05, convergence=1.0. Base: iter 35 (6737bcb, val_bpb=1.9197). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CRITICAL ISSUE: BigramHash-guided quantization caused +0.264 BPB degradation - Mixed int5/int6/int7 allocation was worse than uniform int6 - BigramHash sensitivity calculation was theoretically unsound - Treated global hash table as layer-specific (nonsense) FIX: Revert to baseline's proven uniform int6 quantization - gptq_mixed_precision = False (uniform int6 for all weights) - Remove Innovation openai#4 (BigramHash quant) - it doesn't work - Keep Innovations openai#1, openai#2, openai#3, openai#5 Expected result: Minimal quantization degradation (+0.005 to +0.01 BPB) SP8192 SOTA achieves 1.0855 BPB with uniform int6 (35.9M params)

Synthesis of (a) deep records-folder pass, (b) modded-nanogpt record openai#80 gold standard, (c) FP8 / CUDA Graphs / distillation literature. Key findings: 1. Leaderboard converged on gradient-quality + quantization tricks while leaving raw throughput largely unexplored. Modded-nanogpt has absorbed multiple compute-maxing techniques that haven't crossed into PG. 2. NEVER-TRIED on the leaderboard (open territory): - CUDA Graphs (record openai#80 of modded-nanogpt uses heavily) - Multiple parallel training rounds in unused VRAM - Multiple EMAs / Polyak averaging - Distillation initialization - Larger GPTQ calibration set (>64 batches) - Sequence-length warmup 3. Top-8 ranked actionable items (CUDA Graphs #1, batch-size sweep #2, FP8 head openai#3, multi-EMA openai#4). Cost estimates and confidence per item. 4. Modded-nanogpt techniques NOT in our SOTA: FP8 head + asymmetric rescale, fused softcapped CE, Cautious Weight Decay, "Adam every other step", paired-head Q/K orthogonalization, attention window warmup, MTP. 5. TRIED-AND-DROPPED on PG (don't waste compute): seq_len=4096, parallel residual MLP-skip, 3-loop mini-recurrence, ternary, YaRN, NeoMuon, hash embeddings, etc. Verbatim quotes from records folder for each. 6. FP8 honest analysis: 1.6x typical training speedup (not 3x), with documented loss-spike instability. FP8 only on lm_head + tok_emb is the right initial bet (small surface, well-conditioned matmuls). Decision rules tied to Phase 3 outcome: - Phase 2 mean > 1.0780: prioritize throughput stack (CUDA Graphs + batch sweep + FP8 head) plus Newton-Muon as gradient-quality lever. - Phase 2 mean 1.0760-1.0780: just CUDA Graphs + LR follow-on + Newton-Muon. - Phase 2 mean clears 1.0760: ship; none of this matters this cycle. Still-research items: torch.compile(mode='reduce-overhead'), MTP re-test, qTTT paper body, Cautious WD diff from modded-nanogpt. None spend GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comparable in magnitude to recent merged record gaps: - #1 -> openai#2: 0.0012 BPB - openai#2 -> openai#3: 0.0006 BPB - openai#3 -> openai#4: 0.0007 BPB

…stic gates + AdaSplash kernel wrap Applies the principled remediation for the remaining recompile vectors found in profile_v2 (post commit 7343c06): Fix openai#3 — _ROUTER_DIAGNOSTICS_ACTIVE flag-flip recompiles. The flag toggles at every diagnostic emission (~every 10 train steps). dynamo guards on the global inside compiled forward → toggle invalidates cache slot. Solution: hoist each gated diagnostic block into a `@dynamo_disable`-decorated helper method on the owning class. dynamo treats the call as an opaque op (one fixed graph break, no internal-state guards). Sites refactored: - `_should_diag` (L111) — function itself decorated - `SoftDenseRouter._maybe_record_diag` (L1548-1568 inline → helper at L1657) - `CausalSelfAttention._capture_attn_out_ortho` (L1889 inline → helper) - `MLP._capture_mlp_out_ortho` (L1985 inline → helper) - `MoSLowRankOutputHead._capture_mos_diagnostics` (L2300-2332 inline → helper) The `is_master` computation (dist.get_rank()) is moved inside the MoS helper so the dist call is also out of the compiled mos_head.forward graph. Fix openai#4 — iter 104 AdaSplash kernel SIGABRT under compile+DDP+RevDEQ. The Triton kernel + dynamo's stream/context management + DDP all-reduce + RevDEQ custom autograd interact at C-level producing signal 6 (no Python traceback) when α first exceeded 1.0 post-warmup. Solution: factor the kernel call into `_adasplash_kernel_call` decorated with `@dynamo_disable`. The outer dispatch function keeps the alpha<=1.0 dense fallback compile-traceable; only the Triton invocation runs in pure eager. Adds head_dim ∈ {16,32,64,128,256} guard with dense-SDPA fallback (kernel asserts on other dims, e.g. 96 = 768/8). Result (verification profile v3, 15 iters, dev 2× L40S): step_avg 28.7s → 23.45s = -5.25s (-18.3%) recompile 6+ → 4 (residual is list-length guard from K-jitter, not flag) graph_break N → 0 smoke_test PASS (loss 7.04 → 4.67 over 300 steps) Cumulative throughput vs original: 28.7s → 23.45s (1000-step training: 8.0h → 6.5h, ~1.5h saved). Iter 104 path is now SAFE: AdaSplash kernel runs without SIGABRT under compile+DDP+RevDEQ; head_dim guard prevents kernel-assert at d_head=96. To actually fire AdaSplash, num_heads must be set so head_dim ∈ {16,32,64,128,256}. Also adds GPU-preflight rule (per user directive 2026-04-28): future sessions must `pgrep` + `nvidia-smi` before any GPU launch — silent contention masked prior speedup measurements. CLAUDE.md §0 + memory MEMORY.md indexed. Residual recompile vector (deferred to Fix #3b, task openai#107): list-length guards on `self._mlp_expert_weights_per_iter` at L2509 from K-jitter × incremental append inside the compiled K-loop. Predicted incremental: 23.45s → 18-20s (~10-15% additional) via dynamo_disabled append helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ate-call appends Profile_v3 (post commit d7996da) at 23.45s/step still hit recompile_limit (16) four times. Reason: `len(self._mlp_expert_weights_per_iter) == N` guard at the inline `if self._diag_track_enabled: ... .append()` blocks inside the compiled Block.forward (L2499-2510 and L2566-2587). With K-jitter (8,12,20) and incremental list growth per iter (lengths 0,1,2,...,2K), dynamo sees up to 2*8 + 2*12 + 2*20 = 80 unique list-length values across the K-loop — even with `recompile_limit=16`, the cache thrashes. Fix: extract both diag-track blocks into `@dynamo_disable`-decorated helper methods on `Block`: - `_maybe_track_expert_weights(w_attn, w_mlp)` — per-iter expert weights - `_maybe_track_gate_calls()` — attn/router gate stats (reads `_router_gate_last_mean` from router instead of capturing across blocks) dynamo treats each helper call as opaque (one fixed graph break per forward, no list-length guards on internal state). Same pattern as the diag helpers landed in d7996da. Result (verification profile v4, 15 iters, dev 2× L40S): step_avg 23.45s → 20.79s = -2.66s (-11.3% incremental, -27.6% cumulative vs 28.7s pre-fix baseline) recompile 4 → 2 (residual: `GLOBAL_STATE changed: grad_mode` — train↔eval toggle, fundamentally unavoidable since eval forward must use no_grad) graph_break 0 smoke_test PASS (loss 7.03 → 4.48 over 300 steps) Cumulative throughput trajectory (1000-step training on dev): baseline (commit pre-7343c06): 28.7s/step → 8.0h training +Fix#1+openai#2+openai#4 (commit 7343c06): 27.4s → 7.6h +Fix#3+openai#4 (commit d7996da): 23.45s → 6.5h +Fix#3b (this commit): 20.79s → 5.8h ← NOW total saved: ~2.2h per 1000-step training run Tier 1 throughput foundation is COMPLETE. The 2 remaining recompile triggers are an unavoidable engineering reality (grad_mode toggle on every eval); the cost is bounded (~0.5s/step amortized) and not worth further refactoring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two-value K-jitter set with both values above iter 87's old max (16). Trades training-step throughput (~21-29s/step depending on K) for deeper-FP quality at training time. The narrowed set keeps compile-cache pressure low (2 K × 3 graph-types × grad_mode = 12 slots, comfortably under recompile_limit=16). Eval still uses deq_k_eval=16 (matches K-jitter min). Verification profile v6 (15 iters, dev 2× L40S): step_avg 20.79s → 25.4s (deeper Ks; FP-quality-driven, not throughput) recompile 2 → 0 ★ (residual grad_mode toggle no longer overflows cache) graph_break 0 smoke_test PASS (loss 7.03 → 4.63 over 300 steps) K-cycle cleanly alternates K=16 (~21s) ↔ K=24 (~29s) Risk: small val_bpb cost from dropping K=8 (shallow regime). Principled rescue if observed: add K=8 back as deq_k_jitter_set=(8,16,24) — cache budget allows up to 3 values now that all guard-axes are stabilized. Cumulative throughput trajectory (1000-step training on dev): baseline 28.7s/step → 8.0h +Fix#1+openai#2+openai#4 27.4s → 7.6h +Fix#3+openai#4 23.45s → 6.5h +Fix#3b 20.79s → 5.8h +Fix#5a (this) ~25.4s → 7.1h ← deeper Ks, +22% step cost vs Fix#3b net vs baseline ~12% step-time reduction; 0 recompile_limit hits Tier 1 throughput foundation status: COMPLETE. The 0 recompile_limit hits post-Fix#5a confirm the cache pressure model (12 slots ≤ 16 limit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + post-mortem) User directive 2026-04-28: disable K-jitter, fix K=16 to controlled-isolate the v8 OOM root cause and verify the O(1) memory claim of RevDEQ. Verification profile v9 (15 iters, K=16 fixed, dev 2× L40S): step_avg ~21.5 s/step steady-state (best uniform throughput yet) recompile 2 hits (residual grad_mode toggle, same as v6, bounded) graph_break 0 peak_vram_mb 35,697 MiB ≈ 35.7 GiB — WELL under 44 GiB cap smoke PASS (loss 7.00 → 4.28 over 300 steps) OOM NONE at training (4 K-sweep probe OOMs are pre-existing task openai#102) ROOT CAUSE of v8 OOM (post-mortem, confirmed by v9): RevDEQ.forward (L2674-2719) IS correctly O(1) in K_fwd: - K-loop runs under `with torch.no_grad():` — no autograd activations - `ctx.save_for_backward(x0, y_state, z_state, z_prev)` saves only 4 state tensors of size B×T×D, regardless of K_fwd RevDEQ.backward (L2722-2858) iterates `K_bwd = min(K_fwd, bptt_k) = 2` times (TBPTT pinned at 2 by Fix openai#4): - Each iter does ONE block.forward call with autograd, releases activations after `torch.autograd.grad` - Peak memory ≈ 1-2 block.forward activations, K-INDEPENDENT v6 (K=(16,24) jitter): peak_vram fit comfortably (~36 GiB) v8 (same K=24 + Option G2): OOM at step 1 backward → cause: G2's `torch._dynamo.config.disable=True` toggle around val disrupted dynamo's compile cache state across val→train transitions. Step 1 had to RE-COMPILE block.forward fresh, allocating ~3 GiB workspace ON TOP of the still-resident prior compile state → transient peak above 44 GiB. K=24's slightly-larger seq-internal tensors pushed it over the edge; K=16 happened to fit. v9 (K=16 fixed, no G2): peak 35.7 GiB, runs clean → confirms theory. Cumulative trajectory (1000-step training on dev): baseline 28.7s/step → 8.0h +Fix#1+openai#2+openai#4 27.4s → 7.6h +Fix#3+openai#4 23.45s → 6.5h +Fix#3b 20.79s → 5.8h (K=(8,12,20) jitter, fast K=8 dominated) +Fix#5a K=(16,24) 25.4s → 7.1h (deeper Ks) +Option A FAILED reverted (dynamic=True + DDP crash) +Option G2 FAILED reverted (config toggle → OOM regression) +Fix#5b K=16 fixed ~21.5s → 6.0h ← NOW (uniform, deterministic, O(1) verified) Tradeoff vs Fix#3b's 20.79s blend: K=16 is uniform but slightly slower than the K=8 of Fix#3b's mix. Wins: deterministic step times, single cache slot per graph type, O(1) memory empirically verified, no train↔eval K-distribution shift (deq_k_eval=16 matches deq_k_max=16). Loses: H12 K-jitter granularity. If H12 jitter benefit is needed for val_bpb, re-enable as deq_k_jitter_set= (8, 16, 24) — the cache budget allows up to 3 values now that all other guard-axes are stabilized post-Fix#3+#3b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… sites Profile_v12 identified ~31% of CUDA time spent in 6+ distinct triton_per_fused__to_copy_add_mean_mul_pow_rsqrt kernel variants — the inline `x.pow(2).mean(-1).add(eps).rsqrt() * x * weight` pattern fused with surrounding ops differently per call site. Each variant compiled separately, none particularly well-tuned. Replaces all 5 manual RMSNorm sites with `F.rms_norm` (PyTorch 2.5+ fused kernel, single well-tuned implementation): - CausalSelfAttention._rms_scale (L1828) — used 7 places per forward - CausalSelfAttention.forward_experts kv_flat normalize (L1878) - MLP.mix_experts h normalize (L2061) - MoSHead u_all normalize after FSQ (L2253) - Block.forward x0 Parcae injection (L2558) Per-expert weights (E, D) shape don't fit F.rms_norm's `weight` arg (must broadcast to `normalized_shape=(D,)`), so each site applies F.rms_norm WITHOUT weight then multiplies the per-expert weight as a separate elementwise op. This still uses the single fused norm kernel for the expensive reduction; the elementwise mul is cheap. Verification (profile_v13, 15 iters, dev 2× L40S): step_avg 21.5s → 18.5s (steady-state, -14%) CUDA total 166.6s → 142.7s (-14.4%, 8 active steps) recompile 0 graph_break 0 smoke PASS (loss 7.01 → 4.48 over 300 steps) Top remaining bottlenecks (post-Fix openai#1): - aten::bmm 18.73% — per-expert linears, compute-bound - flash_fwd_kernel 15.56% — FlashAttention forward (Fix openai#2 candidate: AdaSplash) - new fused F.rms_norm variants 3.40%+2.64%+2.01%+2.01% ≈ 10% (down from 20-30%) Cumulative trajectory (1000-step training on dev): baseline 28.7s/step → 8.0h Fix#1+openai#2+openai#4 (7343c06) 27.4s → 7.6h Fix#3+openai#4 (d7996da) 23.45s → 6.5h Fix#3b (477c13d) 20.79s → 5.8h Fix#5a/#5b (cc0329b) ~21.5s → 6.0h (K=16 fixed, O(1) verified) +Fix#1 (this commit) ~18.5s → 5.1h ← NOW Total cumulative speedup vs baseline: 28.7 → 18.5 = -36%. Also includes profile_train.py PROFILE_SKIP_KSWEEP harness improvement + train_gpt.py post-train env-var exit hook (L4694) so future profile runs skip the OOM-prone Hutchinson/Lipschitz K-sweep probes (task openai#102). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cleanup-only Fix openai#2 (AdaSplash α-entmax + num_heads=12, num_kv_heads=6, attn_alpha_target=1.5) NOT VIABLE — SIGABRT at step 1 under compile+DDP+RevDEQ even with the `_adasplash_kernel_call` `@dynamo_disable` wrapper from commit d7996da. The dynamo-disable makes the call opaque to dynamo's TRACER but does NOT isolate Triton's CUDA stream/context from DDP's NCCL streams or RevDEQ's autograd backward replay. Path forward (deferred): - Register AdaSplash via torch.library.custom_op with FakeTensor + Meta backend so DDP/AOTAutograd treat it as a first-class op (more invasive). - OR validate AdaSplash standalone in single-GPU + non-RevDEQ before introducing it back to the full pipeline (smaller blast radius). Reverted: num_heads 12 → 8, num_kv_heads 6 → 4, attn_alpha_target 1.5 → 1.0, attn_alpha_warmup_delay_frac 0.0 → 0.3 (the 0.0 was a profile-only override to make AdaSplash fire from step 1 — not committed for production training). Fix openai#3 (drop 2 redundant `.to(dtype=z_in.dtype)` casts in Block.forward) IS LANDED but THROUGHPUT-NEUTRAL — verification profile v15 measured CUDA total 142.69s (Fix openai#1 baseline) → 143.14s (Fix openai#1+openai#3) = +0.3% (noise). The casts were no-ops (input dtypes already matched z_in.dtype via upstream RMSNorm/F.rms_norm dtype-preservation); Inductor had already optimized them away. Removing them is honest code hygiene but produces no measurable throughput gain. Kept for code clarity. Cumulative trajectory unchanged from Fix openai#1 landing: baseline 28.7s/step → 8.0h Fix#1+openai#2+openai#4 (7343c06) 27.4s → 7.6h Fix#3+openai#4 (d7996da) 23.45s → 6.5h Fix#3b (477c13d) 20.79s → 5.8h Fix#5a/#5b (cc0329b) ~21.5s → 6.0h (K=16 fixed, O(1) verified) Fix#1 (68b0983) ~18.5s → 5.1h ← cumulative -36% vs baseline Top 3 fix attempt summary (per profile_v12 ROI ranking): Fix openai#1 (RMSNorm → F.rms_norm) ✓ LANDED (-14% steady-state) Fix openai#2 (AdaSplash α-entmax) ✗ NOT VIABLE (kernel SIGABRT) Fix openai#3 (drop no-op casts) ✓ LANDED (throughput-neutral cleanup) Net: Fix openai#1 captured the highest-ROI bottleneck. Fix openai#2 deferred until torch.library.custom_op registration is built. Fix openai#3 is now-honest code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Cargo workspace with 3 crates + bin/tri-railway - trios-railway-core: ProjectId/EnvironmentId/ServiceId/DeployId newtypes, RailwayHash::seal (R7 audit triplet), Client over Railway GraphQL v2 - trios-railway-audit: DriftCode D1..D7 + DriftEvent + verdict (Gate-2 PASS criterion), idempotent Neon DDL (railway_projects, railway_services, railway_audit_runs, railway_audit_events, v_railway_drift_open) - trios-railway-experience: append-only L7 writer to .trinity/experience/<YYYYMMDD>.trinity (L21-safe, no truncation) - bin/tri-railway: clap CLI with 'version', 'audit migrate-sql', 'experience append' (mutating verbs deferred to issues openai#4..openai#9) - LICENSE Apache-2.0, README, AGENTS.md, TASK.md per crate - CI: fmt --check + clippy -D warnings + build + test - Neon DDL applied to neondb (5 objects verified) - 16 unit tests passing (6 audit + 8 core + 2 experience) - ascii-only sources, R1 (no .sh / no Python in scripts/) Anchor: phi^2 + phi^-2 = 3. Closes #1 Agent: GENERAL

…penai#3 openai#4 openai#5) - AuthMode enum: auto-detect UUID-shaped Project-Access-Token - queries: project_view, recent_deployments, service_variables, latest_deploy_id - mutations: service_create, service_instance_set_image, variable_upsert, service_redeploy, service_delete - bin/tri-railway: 'service list', 'service deploy', 'service redeploy', 'service delete' verbs (R7 audit triplet appended on deploy) - Live-tested against Railway IGLA project (verified service list + redeploy on seed-43 SUCCESS digest e53ade00) - Fixed clippy items_after_test_module + needless_lifetimes - 16 unit tests still green; build green Closes openai#3 Closes openai#4 Closes openai#5 Refs: L-T5 (trainer-igla-sot), Gate-2 deadline 2026-04-30T23:59Z Agent: GENERAL

…time The workspace Cargo.toml still referenced openssl/postgres-openssl while crates/trios-railway-audit already used rustls via workspace=true — causing Railway cargo build --locked to fail at dependency resolution. - Replace openssl/postgres-openssl with tokio-postgres-rustls/rustls/webpki-roots - Remove libssl3 from Dockerfile.mcp runtime (pure-Rust TLS, no system lib needed) Closes openai#4 Agent: GENERAL

…tartCommand The previous Dockerfile used USER trios with /usr/sbin/nologin shell and ENTRYPOINT. Railway's startCommand override may have conflicted with the ENTRYPOINT or the nologin shell may have prevented proper process execution. Changes: - Remove USER trios (run as root, standard for Railway services) - Remove Docker HEALTHCHECK (curl not in slim image; Railway uses its own) - Use CMD instead of ENTRYPOINT (allows Railway startCommand override) - Remove startCommand from railway.json (let Dockerfile CMD handle it) - Add healthcheckTimeout: 300 to railway.json Closes openai#4 Agent: GENERAL

Build stage uses rust:1.91-slim (Debian Trixie, GLIBC 2.39) but runtime used debian:bookworm-slim (GLIBC 2.36). Binary crashes on boot with 'GLIBC_2.39 not found'. Fix: use debian:trixie-slim for runtime to match build stage GLIBC version. Wire evidence: trios-train-gate2-ONE-acc1-s1597-gf16 status=CRASHED within 2 minutes of deploy, confirming GLIBC mismatch as root cause. Closes openai#4 Agent: GENERAL

… coef sweeps Per user directive 2026-04-30 (feedback_throughput_priority.md): throughput-bearing iters (Triton kernels, sparse dispatch, sparse attention) take queue priority over coef-sweep follow-ups for iter 112's Gram penalty. Throughput compounds research velocity — faster step rate = more iters per unit time. Tier 1 reordered: openai#1 (DROPPED) iter 113 openai#2 iter 112 — IN FLIGHT openai#3 iter 117b-2 — Triton entmax (THROUGHPUT) openai#4 iter 117b-3 — Sparse MoE dispatch (THROUGHPUT, biggest win) openai#5 iter 117b-3b — Sparse-Q attention (THROUGHPUT, promoted from Tier 2) openai#6 iter 120 — RRAttention (THROUGHPUT, promoted from Tier 2) openai#7 iter 108 — k_eval=10 throughput openai#8 iter 110 — refinement re-enable (last) Deferred coef sweeps (post-throughput): iter 112b/c/d. These remain conditional on iter 112 promotion AND will only run after the throughput iters are exhausted. Anti-pattern explicitly avoided: chasing diminishing val_bpb gains via hyperparameter tuning while a 1.5-4x wallclock improvement sits unmerged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…2048 Per user challenge 2026-04-30: "Would RRAttention hurt throughput as the optimized SDPA is replaced?" — answered yes. RRAttention is the same SDPA-replacement class as iter 106 NSA which was DROPPED 2026-04-29 because NSA = 0.42x FlashAttention at T=2048 (official fla-org Triton benchmark). At T=2048 we're in the FA-fusion + tensorcore-saturation regime; manual sparse attention loses on memory traffic, kernel launch overhead, and tensorcore utilization simultaneously. The component file's "8/8 PASS, tau=1.0 bit-identical" claim is a correctness check, NOT a throughput check. Pure-PyTorch component cannot compete with F.scaled_dot_product_attention at T=2048. Re-queue paths: - flex_attention (PyTorch 2.5+) with score_mod/block_mask - Custom Triton kernel with selection inside FA tile - Defer until T-scaling phase (T=4096+) Tier 1 reordered: openai#1 (DROPPED) iter 113 openai#2 iter 112 — IN FLIGHT openai#3 iter 117b-2 — Triton entmax (kernel-only, doesn't replace SDPA) openai#4 iter 117b-3 — Sparse MoE dispatch (replaces MLP path, not SDPA) openai#5 iter 117b-3b — Sparse-Q attention (smaller-Q gather; SDPA call preserved) openai#6 iter 108 — k_eval=10 (one-line config) openai#7 iter 110 — refinement re-enable DEMOTED to Deferred: iter 120 (RRAttention). New durable rule: feedback_sdpa_replacement_at_T2048.md — never queue sparse-attention iters that REPLACE F.scaled_dot_product_attention at T=2048 without a fused implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Problem-driven audit of records/track_10min_16mb/ 2026-04-27 SOTA stack (val_bpb=1.0611). For each records-feature, identify the underlying problem and check if iter 117 v5 has it. Items where our model addresses the problem differently (RevDEQ, Parcae, soft-MoE) are NOT queued. H91 — Phased TTT (eval-only LoRA per-doc adapter). LARGEST single gain in records corpus: -0.05 to -0.10 BPB. We have zero TTT infrastructure. Tier 4 priority openai#2. H92/H93 — Logit softcap (Gemma2-style). Trivial 1-liner. -0.005 to -0.015 BPB. Tier 4 priority openai#1 (cheapest first). H94 — GPTQ + LQER int4-rank4 quantization. Replaces our per-row int6 with Hessian-aware quant + low-rank correction on worst-3 tensors. Affects val_bpb_int6 (the promotion gate) directly. -0.02 to -0.04 BPB on int6. Tier 4 priority openai#3. H95 — SP1024 -> SP8192 + CaseOps. Tokenizer upgrade + lossless case preprocessor. -0.02 to -0.04 BPB. CONDITIONAL on H96 (artifact budget). Tier 4 priority openai#5. H96 — Per-group lrzip+brotli compression. Frees ~280 KB artifact, 0 BPB direct. Enables H95. Tier 4 priority openai#4. H97 — attn-gate int8-per-row quant. Marginal artifact win. H98 — Sparse attention head-output gate (window=12). -0.005 to -0.015 BPB. Composes with our gated-attn structure. H99 — SmearGate (BOS-fixed). Position-mixing memory channel orthogonal to DEQ temporal mixing. -0.005 to -0.015 BPB. NOT queued (architecturally subsumed): - U-Net skips (RevDEQ shared-block + x_0 injection) - Depth recurrence (RevDEQ FP iteration IS this) - Parallel decoder (soft-dense MoE has E=16 parallel paths) - LN scale 1/sqrt(layer+1) (Parcae per-dim A_bar damping) - LeakyReLU(0.5)^2 in gated MLP (iter 83 REVERTED) - qk_gain init=5.0 (already at L250) - EMA, partial RoPE (already present) Records-derived priority order (within Tier 4): openai#1 H93 logit softcap (cheapest) -> openai#2 H91 TTT (largest gain) -> openai#3 H94 GPTQ+LQER -> openai#4 H96 compression -> openai#5 H95 tokenizer -> openai#6/7 H98/99 small adds -> openai#8 H97 attn-gate quant. Lands AFTER Tier 1 throughput iters (117b-2/3/3b) complete so each TTT trial is fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update README.md

70c126b

0hq added the not ready for review label Mar 19, 2026

0hq marked this pull request as draft March 19, 2026 16:57

0hq closed this Mar 19, 2026

dexhunter mentioned this pull request Mar 20, 2026

Community Tool: Parameter Golf Leaderboard Monitor (CLI + Claude Code Skill) #158

Closed

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026

docs: add PR #4 summary placeholder

b2f8261

FlashyFlash3011 mentioned this pull request Mar 25, 2026

Record: GatedAttn + ValueResidual + Full QAT + lzma-9 + BigramHash(2048) #347

Open

MVPandey mentioned this pull request Mar 26, 2026

[Research Non-Record] Pure raw-byte JEPA negative result #906

Open

Meirzhan05 mentioned this pull request Apr 28, 2026

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876

Open

jayyvk mentioned this pull request Apr 29, 2026

Non-record: Energy as the missing leaderboard axis - Wh per bpb drop across 5 configs #1952

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Optimized Muon/Architecture research by @NOPIMPOSSSIBLEWHY#4

[WIP] Optimized Muon/Architecture research by @NOPIMPOSSSIBLEWHY#4
NOPIMPOSSSIBLEWHY wants to merge 1 commit intoopenai:mainfrom
NOPIMPOSSSIBLEWHY:main

NOPIMPOSSSIBLEWHY commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NOPIMPOSSSIBLEWHY commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants