Skip to content

Record: 11L Int6 + SmearGate + BigramHash + Depth Recurrence#268

Open
brn-mwai wants to merge 2 commits intoopenai:mainfrom
brn-mwai:submission/brn-mwai-v1
Open

Record: 11L Int6 + SmearGate + BigramHash + Depth Recurrence#268
brn-mwai wants to merge 2 commits intoopenai:mainfrom
brn-mwai:submission/brn-mwai-v1

Conversation

@brn-mwai
Copy link
Copy Markdown

Summary

Competitive recipe with a novel depth recurrence option for the 10-min 16MB track.

Architecture

  • 11 layers, 512 dim, 3x MLP expansion (hidden=1536)
  • 8 attention heads, 4 KV heads (GQA)
  • ~27M parameters, quantized to int6 + zstd-22

Techniques

Technique Source
Int6 per-row quantization + zstd-22 Competition meta
11 layers with 3x MLP Funded by int6 byte savings
SmearGate (learned token blending) PR #135
BigramHash (4096x128 hash table) PR #162
Muon optimizer with weight decay (0.03) PR #179
Stochastic Weight Averaging (last 50%) PR #162
Sliding window eval (stride=64) PR #56
FP16 embedding passthrough PR #42
Orthogonal initialization PR #135
Depth recurrence (optional mode) Novel

Novel: Depth Recurrence

Optional mode that shares transformer blocks across multiple loops:

  • 2 prelude + 1 shared (looped 7x) + 2 coda = 5 unique blocks, 11 effective depth
  • Per-iteration embeddings tell the shared block which pass it's on
  • Freed parameter budget can go to wider model (640+ dim)
  • Motivated by MDL theory: lower L(model) frees bits for L(data|model)

Vectorized Int6 Packing

Rewrote the int6 bit-packing from Python loops to vectorized NumPy ops. 27M params pack in ~90ms instead of minutes.

Validation

  • BPB: Pending 8xH100 validation (applying for compute grant)
  • Code has been syntax-verified and int6 roundtrip tested locally

Checklist

  • Submission folder in records/track_10min_16mb/
  • README.md with approach description
  • submission.json with metadata
  • train_gpt.py (single file, self-contained)
  • Training log (pending compute)
  • BPB score (pending compute)

… pending)

Competitive recipe with novel depth recurrence option:
- 11 layers, 512 dim, 3x MLP, Int6 quant + zstd-22
- SmearGate, BigramHash, Muon WD, SWA, sliding window eval
- Optional depth recurrence: 5 unique blocks, 11 effective depth
- Vectorized int6 packing, FP16 embedding passthrough

BPB pending 8xH100 validation run.
@mohosy
Copy link
Copy Markdown

mohosy commented Mar 21, 2026

yoo depth recurrence is lowkey smart, sharing blocks like that frees up so much param budget. lmk when you get a score on this im curious how it stacks up

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 11L Int6 + SmearGate + BigramHash + Depth Recurrence

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache


PR #268 — Brian Mwai (@brn-mwai)
Title: Record: 11L Int6 + SmearGate + BigramHash + Depth Recurrence
Head SHA: 498bec4
Score: Not submitted (all null in submission.json)


Check 1: N-gram Family Bug (CLOSE trigger)

Result: CLEAN

BigramHash.forward (line 682–688):

prev = F.pad(input_ids[:, :-1], (1, 0))   # prev[t] = input_ids[t-1]
h = (prev.long() * 2654435761 + input_ids.long()) % self.table_size

input_ids is the input context sequence (x). At position t, input_ids[t] is the current context token (known), not the target. The target is y[t] = input_ids[t+1]. The hash key is (context[t-1], context[t]) — a bigram of two already-seen tokens. No target token leakage. This is the legal BigramHash pattern, not the CLOSE variant.


Check 2: Pre-Quant TTT (CLOSE trigger)

Result: CLEAN — no TTT present

There is no test-time training in this submission. val_tokens is only ever consumed in eval_val and eval_val_standard, both of which run under torch.inference_mode() with model.eval(). No gradient step is taken on val data at any point. No AdamW, no fine-tuning loop on val tokens. The submission uses SWA (stochastic weight averaging over train checkpoints) and sliding-window evaluation — both legal.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants