Skip to content

Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)#465

Open
LoquiAuris wants to merge 1 commit intoopenai:mainfrom
LoquiAuris:submission/dense-10L-sp1024
Open

Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)#465
LoquiAuris wants to merge 1 commit intoopenai:mainfrom
LoquiAuris:submission/dense-10L-sp1024

Conversation

@LoquiAuris
Copy link
Copy Markdown

Record Submission: 10L d=512 Int5-MLP + Int6-Attn + BigramHash + SmearGate

Author: Loqui Auris (@LoquiAuris)
val_bpb: 1.1508 (mean of 3 seeds, std=0.00012)
Artifact size: 15,680,288 bytes (15.68 MB)
Training time: ~10 minutes on 8×H100

Results

Seed | Post-quant val_bpb | Artifact (bytes) -- | -- | -- 42 | 1.15097 | 15,680,288 1337 | 1.15077 | 15,654,632 2024 | 1.15074 | 15,639,761 Mean | 1.15083 ±0.00012 | —

Despite the tokenizer efficiency advantage, sp1024 with 10 layers at full d=512 width outperformed all sp8192 configurations. The layer count advantage (10L vs 6-8L) at d=512 exceeds the tokenizer efficiency gain on H100 with full training.

However, the Int6 embedding finding remains significant: it enables large-vocabulary models within severe artifact constraints and may prove valuable as quantization techniques improve and more layers become feasible at larger vocab sizes.

Development Process

This submission was developed through systematic architecture search:

  1. Tokenizer exploration: Tested sp1024, sp2048, sp4096, sp8192 — identified the embedding size vs model capacity trade-off as the key constraint
  2. Width vs depth analysis: Confirmed d=512 (width) > d=384/448 (narrower + deeper) across all tokenizer sizes at this parameter budget
  3. Int6 embedding discovery: Found that embedding quantization to 6-bit has negligible quality impact (+0.0005 bpb), unlocking large vocabularies at full model width
  4. 8 H100 configurations tested across 2 pod sessions, plus extensive local testing on Apple Silicon (500-step ablations)
  5. Final result: sp1024 d=512 10L produces the best bpb by maximizing layer count at full width within the 16 MB budget

Local Testing Methodology

All architecture decisions were validated through 500-step local runs on Apple Silicon (MPS backend) using AdamW, then confirmed on 8×H100 with the full Muon + SWA + PR #162 stack. Local-to-H100 scaling ratio was approximately 1.85-1.95×.

Hardware & Cost

  • Training: 8×H100 SXM (RunPod)
  • Local testing: Apple Silicon (MPS)
  • Total H100 time: ~2.5 hours across 2 pod sessions
  • Estimated cost: ~$65 in RunPod credits

Files

  • train_gpt.py — Complete training script with environment variable configuration
  • train.log — Training log from seed 42 (primary submission)
  • submission.json — Submission metadata
  • README.md — This file
Record Submission: 10L d=512 Int5-MLP + Int6-Attn + BigramHash + SmearGate Author: Loqui Auris ([@LoquiAuris](https://github.com/loquiauris)) val_bpb: 1.1508 (mean of 3 seeds, std=0.00012) Artifact size: 15,680,288 bytes (15.68 MB) Training time: ~10 minutes on 8×H100 Results SeedPost-quant val_bpbArtifact (bytes)421.1509715,680,28813371.1507715,654,63220241.1507415,639,761Mean1.15083 ±0.00012— Approach Architecture Standard PR #162 transformer stack with the following configuration:

10 layers, d_model=512, 8 attention heads, 4 KV heads (GQA)
3× FFN expansion (hidden=1536) with ReLU² activation
SmearGate: learned blend with previous token representation
BigramHash: 4096 buckets, dim=128, projected to 512
U-Net skip connections between symmetric layer pairs
RMSNorm, logit softcap=30.0, orthogonal initialization
RoPE positional encoding (persistent=False)
Tied embeddings via F.linear(x, tok_emb.weight)
Vocabulary: sp1024 (1,024 BPE tokens)

Training

Optimizer: Muon (matrix_lr=0.02, momentum=0.99 with warmup from 0.92 over 1500 steps) + AdamW for embeddings and scalars
Weight decay: 0.04 (Muon), 0.01 (AdamW)
Gradient clipping: 0.3
Sequence length: 2048
Batch size: 786,432 tokens
Warmup: 20 steps
Warmdown: 3000 iterations (cosine schedule)
SWA: start_frac=0.5, checkpoint every 50 steps, 29 checkpoints averaged
Steps completed: ~7,600 in 10 minutes

Quantization & Compression

MLP weights: Int5 per-row symmetric (clip=15)
Attention weights: Int6 per-row symmetric (clip=31)
Embeddings: FP16 passthrough
Norms, gates, control tensors: FP16 passthrough
Compression: zstd level 22

Evaluation

Sliding window with stride=64, seq_len=2048

Key Finding: Int6 Embedding Quantization
During development, we explored using sp8192 (8,192-token vocabulary) to improve tokenizer efficiency. The sp8192 tokenizer encodes at 3.79 bytes/token vs sp1024's 2.44 — a 55% improvement that directly reduces bits-per-byte.
The challenge: sp8192's embedding table at d=512 costs 8.39 MB in FP16, consuming over half the 16 MB budget and limiting the model to 6-8 layers.
We discovered that embedding tables can be quantized to Int6 (6-bit per-row symmetric) with negligible quality loss:
Embed quantizationval_bpbPenalty vs FP16FP16 (baseline)2.2352—Int82.2354+0.0002Int62.2357+0.0005
A penalty of +0.0005 bpb is within noise. This enabled sp8192 at d=512 — a combination previously considered impossible under the 16 MB constraint.
sp8192 + Int6 Embed Results (H100)
ConfigPost-quant bpbArtifactHeadroomsp8192 d=512 6L Int6-embed1.201011.97 MB4.0 MBsp8192 d=512 7L Int6-embed1.186313.57 MB2.4 MBsp8192 d=512 8L Int6-embed1.179414.99 MB1.0 MBsp8192 d=384 9L FP16-embed1.188912.63 MB3.4 MBsp1024 d=512 10L (this submission)1.151015.68 MB0.3 MB
Despite the tokenizer efficiency advantage, sp1024 with 10 layers at full d=512 width outperformed all sp8192 configurations. The layer count advantage (10L vs 6-8L) at d=512 exceeds the tokenizer efficiency gain on H100 with full training.
However, the Int6 embedding finding remains significant: it enables large-vocabulary models within severe artifact constraints and may prove valuable as quantization techniques improve and more layers become feasible at larger vocab sizes.
Development Process
This submission was developed through systematic architecture search:

Tokenizer exploration: Tested sp1024, sp2048, sp4096, sp8192 — identified the embedding size vs model capacity trade-off as the key constraint
Width vs depth analysis: Confirmed d=512 (width) > d=384/448 (narrower + deeper) across all tokenizer sizes at this parameter budget
Int6 embedding discovery: Found that embedding quantization to 6-bit has negligible quality impact (+0.0005 bpb), unlocking large vocabularies at full model width
8 H100 configurations tested across 2 pod sessions, plus extensive local testing on Apple Silicon (500-step ablations)
Final result: sp1024 d=512 10L produces the best bpb by maximizing layer count at full width within the 16 MB budget

Local Testing Methodology
All architecture decisions were validated through 500-step local runs on Apple Silicon (MPS backend) using AdamW, then confirmed on 8×H100 with the full Muon + SWA + PR #162 stack. Local-to-H100 scaling ratio was approximately 1.85-1.95×.
Hardware & Cost

Training: 8×H100 SXM (RunPod)
Local testing: Apple Silicon (MPS)
Total H100 time: ~2.5 hours across 2 pod sessions
Estimated cost: ~$65 in RunPod credits

Files

train_gpt.py — Complete training script with environment variable configuration
train.log — Training log from seed 42 (primary submission)
submission.json — Submission metadata
README.md — This file

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 10L d=512 Int5-MLP Int6-Attn sp1024 (val_bpb=1.1508)

BPB: 1.1508 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 4e781492461a, file records/track_10min_16mb/2026-03-22_DenseTransformer_sp1024_10L/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.10s, dim=512, layers=9, vocab=8192, code=57747 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.10s, dim=512, layers=9, vocab=8192, code=57747 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants