[WIP] Recurrent MQA Transformer — depth recurrence + weight tying (nidhilak-Aquarius) by nidhilak-Aquarius · Pull Request #29 · openai/parameter-golf

nidhilak-Aquarius · 2026-03-18T22:40:48Z

Recurrent MQA Transformer — WIP Submission

My approach draws from two ideas separated by 2,000 years.

The Chakravyuha in the Mahabharata achieves depth through repetition —
one structural unit looping inward, creating power far beyond its apparent
size. Kalaripayattu, Kerala's martial art, teaches that maximum force comes
from finding the exact marma point, not from raw strength.

Core innovation: One shared TransformerBlock looped 12 times instead
of 9 unique blocks. Same computational depth. 12x fewer unique parameters.

The marma insight: weight sharing acts as a regularizer — the same weights
must generalize across ALL depths simultaneously, forcing more robust
representations than unique per-layer weights ever could.

Architecture:

Depth recurrence: 1 shared block × 12 loops (Universal Transformer style)
Weight-tied embeddings: zero-parameter output projection
Multi-Query Attention: 8Q / 1KV heads (43% fewer attention params)
SwiGLU FFN: outperforms GELU at identical parameter count (Shazeer 2020)
RoPE: zero learned positional parameters

Results so far:

Unique parameters: ~3.5M
Compressed artifact: ~5.2MB (32.5% of 16MB limit)
Unused budget: 10.8MB
val_bpb on FineWeb: pending GPU compute grant

Hypothesis: Recurrence depth N=12 outperforms N=8 at identical
parameter count, with diminishing returns beyond N=16. The compute
grant will map this curve empirically.

Phase 2: BitNet ternary weights {-1,0,+1} at log2(3)=1.58 bits vs
16 bits = ~10x more effective parameters within the same 16MB artifact.

nidhilak-Aquarius · 2026-03-19T14:23:14Z

Recurrent MQA Transformer — Core Logic

This submission focuses on maximizing effective model capacity under a strict artifact constraint through parameter sharing and architectural efficiency:

Depth Recurrence — a single TransformerBlock is looped 12x, achieving deep computation with ~3.28M unique parameters (~39M effective).
Weight-Tied Embeddings — input embeddings are reused for output projection, eliminating additional parameters.
Multi-Query Attention (8Q / 1KV) — shared KV heads reduce attention parameters and memory overhead.
SwiGLU FFN — improved efficiency over GELU at the same parameter count.
RoPE — parameter-free positional encoding.

Artifact size (measured): ~2.82MB (int8 + zlib, smoke test) — well under the 16MB constraint, leaving substantial headroom for further optimization.

Hypothesis: Increasing recurrence depth (N=12) improves performance over shallower configurations at fixed parameter count, with diminishing returns beyond N~16.

Local smoke test completed successfully; full GPU evaluation (val_bpb) pending compute grant.

openai#143) (openai#29) * feat(audit): tri-railway audit run + verdict CLI (closes openai#9, refs openai#143) Anchor: phi^2 + phi^-2 = 3. Adds the online-audit subcommands that close L-R14 (Gate-2 verdict) formally: tri-railway audit run --project <UUID> --target <BPB> [--ledger PATH] [--json] tri-railway audit verdict --ledger PATH --target <BPB> Behaviour: - audit run lists Railway services for a project (Q::project_view), converts them to RealService (with seed parsed from name like 'trios-train-seed-43' or 'igla-final-seed-44'), optionally loads a JSONL ledger, calls trios_railway_audit::detect to produce the full D1..D7 drift event set, runs verdict() to compute Gate2Pass / NotYet / Drift, prints a text summary, optionally JSON, then seals one R7 audit triplet to .trinity/experience via the existing experience writer. Exit codes: 0 = GATE-2 PASS (>= 3 services with bpb < target, no error drift) 1 = DRIFT (any error-severity event) 2 = NOT YET (no errors, target not yet met) - audit verdict is the offline form for cron/CI: takes a JSONL ledger snapshot already serialized from Neon, computes the same verdict against synthetic services, prints one line, exits with the same codes. Auth fix in trios-railway-core: RAILWAY_TOKEN_AUTH env var allows forcing 'team' (Bearer) vs 'project' (Project-Access-Token) when the UUID-shape heuristic guesses wrong. Personal API tokens are also UUID-shaped but require Bearer; without this override, authenticating to backboard.railway.com returned 'Not Authorized'. Verified with both curl variants against the live IGLA project. R5-honest verification (logs in PR body): cargo build --bin tri-railway --locked : OK cargo test --workspace --locked : 22 passed, 0 failed cargo clippy -D warnings : 0 Live smoke against IGLA (e4fe33bb-...): 18 services, 16 D1_ORPHAN warnings, NOT YET, exit=2, R7 triplet sealed at /tmp/audit-smoke/.trinity/experience/<date>.trinity Synthetic ledger smoke: 3 seeds bpb<1.85 -> GATE-2 PASS, exit=0 3 seeds bpb<1.85 vs target=1.50 -> NOT YET, exit=2 1 seed bpb=3.5e38 -> DRIFT (D5_OVERFLOW), exit=1 Closes openai#9. Refs openai#143 (IGLA RACE Gate-2 / L-R14). * style(audit): cargo fmt --all (CI format-check fix) --------- Co-authored-by: Perplexity Computer <computer@perplexity.ai>

nidhilak-Aquarius added 5 commits March 19, 2026 04:02

Create README.md

fc4762c

Create submission.json

2bad75e

Create train.log

8463854

Create train_gpt.py

8f82b57

Update train.log

95b80a9

0hq added the not ready for review label Mar 19, 2026

0hq closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Recurrent MQA Transformer — depth recurrence + weight tying (nidhilak-Aquarius)#29

[WIP] Recurrent MQA Transformer — depth recurrence + weight tying (nidhilak-Aquarius)#29
nidhilak-Aquarius wants to merge 5 commits intoopenai:mainfrom
nidhilak-Aquarius:main

nidhilak-Aquarius commented Mar 18, 2026

Uh oh!

nidhilak-Aquarius commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nidhilak-Aquarius commented Mar 18, 2026

Recurrent MQA Transformer — WIP Submission

Uh oh!

nidhilak-Aquarius commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants