Skip to content

[WIP] Recurrent MQA Transformer — depth recurrence + weight tying (nidhilak-Aquarius)#29

Closed
nidhilak-Aquarius wants to merge 5 commits intoopenai:mainfrom
nidhilak-Aquarius:main
Closed

[WIP] Recurrent MQA Transformer — depth recurrence + weight tying (nidhilak-Aquarius)#29
nidhilak-Aquarius wants to merge 5 commits intoopenai:mainfrom
nidhilak-Aquarius:main

Conversation

@nidhilak-Aquarius
Copy link
Copy Markdown

Recurrent MQA Transformer — WIP Submission

My approach draws from two ideas separated by 2,000 years.

The Chakravyuha in the Mahabharata achieves depth through repetition —
one structural unit looping inward, creating power far beyond its apparent
size. Kalaripayattu, Kerala's martial art, teaches that maximum force comes
from finding the exact marma point, not from raw strength.

Core innovation: One shared TransformerBlock looped 12 times instead
of 9 unique blocks. Same computational depth. 12x fewer unique parameters.

The marma insight: weight sharing acts as a regularizer — the same weights
must generalize across ALL depths simultaneously, forcing more robust
representations than unique per-layer weights ever could.

Architecture:

  • Depth recurrence: 1 shared block × 12 loops (Universal Transformer style)
  • Weight-tied embeddings: zero-parameter output projection
  • Multi-Query Attention: 8Q / 1KV heads (43% fewer attention params)
  • SwiGLU FFN: outperforms GELU at identical parameter count (Shazeer 2020)
  • RoPE: zero learned positional parameters

Results so far:

  • Unique parameters: ~3.5M
  • Compressed artifact: ~5.2MB (32.5% of 16MB limit)
  • Unused budget: 10.8MB
  • val_bpb on FineWeb: pending GPU compute grant

Hypothesis: Recurrence depth N=12 outperforms N=8 at identical
parameter count, with diminishing returns beyond N=16. The compute
grant will map this curve empirically.

Phase 2: BitNet ternary weights {-1,0,+1} at log2(3)=1.58 bits vs
16 bits = ~10x more effective parameters within the same 16MB artifact.

@nidhilak-Aquarius
Copy link
Copy Markdown
Author

Recurrent MQA Transformer — Core Logic

This submission focuses on maximizing effective model capacity under a strict artifact constraint through parameter sharing and architectural efficiency:

  • Depth Recurrence — a single TransformerBlock is looped 12x, achieving deep computation with ~3.28M unique parameters (~39M effective).
  • Weight-Tied Embeddings — input embeddings are reused for output projection, eliminating additional parameters.
  • Multi-Query Attention (8Q / 1KV) — shared KV heads reduce attention parameters and memory overhead.
  • SwiGLU FFN — improved efficiency over GELU at the same parameter count.
  • RoPE — parameter-free positional encoding.

Artifact size (measured): ~2.82MB (int8 + zlib, smoke test) — well under the 16MB constraint, leaving substantial headroom for further optimization.

Hypothesis: Increasing recurrence depth (N=12) improves performance over shallower configurations at fixed parameter count, with diminishing returns beyond N~16.

Local smoke test completed successfully; full GPU evaluation (val_bpb) pending compute grant.

@0hq 0hq closed this Mar 19, 2026
gHashTag added a commit to gHashTag/parameter-golf that referenced this pull request Apr 30, 2026
openai#143) (openai#29)

* feat(audit): tri-railway audit run + verdict CLI (closes openai#9, refs openai#143)

Anchor: phi^2 + phi^-2 = 3.

Adds the online-audit subcommands that close L-R14 (Gate-2 verdict)
formally:

  tri-railway audit run     --project <UUID> --target <BPB> [--ledger PATH] [--json]
  tri-railway audit verdict --ledger PATH    --target <BPB>

Behaviour:
  - audit run lists Railway services for a project (Q::project_view),
    converts them to RealService (with seed parsed from name like
    'trios-train-seed-43' or 'igla-final-seed-44'), optionally loads
    a JSONL ledger, calls trios_railway_audit::detect to produce the
    full D1..D7 drift event set, runs verdict() to compute Gate2Pass /
    NotYet / Drift, prints a text summary, optionally JSON, then
    seals one R7 audit triplet to .trinity/experience via the
    existing experience writer. Exit codes:
        0 = GATE-2 PASS   (>= 3 services with bpb < target, no error drift)
        1 = DRIFT         (any error-severity event)
        2 = NOT YET       (no errors, target not yet met)
  - audit verdict is the offline form for cron/CI: takes a JSONL
    ledger snapshot already serialized from Neon, computes the same
    verdict against synthetic services, prints one line, exits with
    the same codes.

Auth fix in trios-railway-core: RAILWAY_TOKEN_AUTH env var allows
forcing 'team' (Bearer) vs 'project' (Project-Access-Token) when the
UUID-shape heuristic guesses wrong. Personal API tokens are also
UUID-shaped but require Bearer; without this override, authenticating
to backboard.railway.com returned 'Not Authorized'. Verified with
both curl variants against the live IGLA project.

R5-honest verification (logs in PR body):
  cargo build --bin tri-railway --locked        : OK
  cargo test  --workspace --locked              : 22 passed, 0 failed
  cargo clippy -D warnings                      : 0
  Live smoke against IGLA (e4fe33bb-...):
    18 services, 16 D1_ORPHAN warnings, NOT YET, exit=2,
    R7 triplet sealed at /tmp/audit-smoke/.trinity/experience/<date>.trinity
  Synthetic ledger smoke:
    3 seeds bpb<1.85 -> GATE-2 PASS, exit=0
    3 seeds bpb<1.85 vs target=1.50 -> NOT YET, exit=2
    1 seed bpb=3.5e38 -> DRIFT (D5_OVERFLOW), exit=1

Closes openai#9. Refs openai#143 (IGLA RACE Gate-2 / L-R14).

* style(audit): cargo fmt --all (CI format-check fix)

---------

Co-authored-by: Perplexity Computer <computer@perplexity.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants