Skip to content

Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431#1205

Open
SergheiBrinza wants to merge 2 commits intoopenai:mainfrom
SergheiBrinza:submission/2026-04-01_TurboMuon_EngramLite_Improved
Open

Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431#1205
SergheiBrinza wants to merge 2 commits intoopenai:mainfrom
SergheiBrinza:submission/2026-04-01_TurboMuon_EngramLite_Improved

Conversation

@SergheiBrinza
Copy link
Copy Markdown

@SergheiBrinza SergheiBrinza commented Apr 1, 2026

Summary

Non-record submission based on the PR #1089 Turbo-Muon + EngramLite stack with hyperparameter tuning.

val_bpb: 1.1431 (3-seed mean, std 0.0007)

Seed val_bpb (sliding)
1337 1.1425
42 1.1438
2024 1.1431

Changes from PR #1089

  • Higher LR (0.030 vs 0.025) for faster convergence
  • Wider EngramLite (10240x48 vs 8192x32) for more n-gram coverage
  • VE on layers 8,9,10 (vs 9,10) for additional token identity injection
  • Warmdown 4500 (vs 3500) for smoother weight averaging
  • Muon momentum warmup 1000 steps (vs 1500)

Key Finding

The increased model size (~31.6M vs 30.7M params) pushed the artifact to 16.36MB pre-compression, forcing all 66 weight groups into int5 with 0 promotions to int6/int7 and 20.5% selective pruning. This aggressive quantization likely offset the architectural gains. The 16MB budget is extremely tight — even small parameter increases can cascade into significant quality loss through the quantization pipeline.

Hardware

8xH100 80GB SXM, 600s training, ~5550 steps at 106ms/step.

… 1.1431

Based on PR openai#1089 stack with hyperparameter tuning:
- Higher LR (0.030 vs 0.025) for faster convergence
- Wider EngramLite (10240x48 vs 8192x32)
- VE on layers 8,9,10 (vs 9,10)
- Warmdown 4500 (vs 3500)
- Muon momentum warmup 1000 steps (vs 1500)

3-seed mean: 1.1431 (std 0.0007)
Seeds: 1337=1.1425, 42=1.1438, 2024=1.1431
@SergheiBrinza SergheiBrinza force-pushed the submission/2026-04-01_TurboMuon_EngramLite_Improved branch from 2d2f0d7 to 974948e Compare April 1, 2026 01:21
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Turbo-Muon + EngramLite(10240) + VE(8,9,10) — val_bpb 1.1431

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #1205 Audit — Two Submissions Head SHA: 974948e --- ## Submission 1: 2026-03-21_MixedQuant_BigramHash_SWA (val_bpb: 1.2421) BigramHash implementation (lines 525–527): python prev = F.pad(input_ids[:, :-1], (1, 0), value=0) bh = (prev * 7919 + input_ids) % self.bigram_hash_size x = self.tok_emb(input_ids) + self.bigram_proj(self.bigram_embed(bh)) Hash key uses prev (context shift of input) and input_ids (current input token). Target IDs (target_ids) are NOT XOR'd into the hash key. This is the correct BigramHash pattern — no illegal target-leakage into the hash. No XOR anywhere in the hash construction. No TTT: eval_val() (lines 179–211) runs entirely under torch.inference_mode(), calls no optimizer, performs no backward pass. No TTT variables or functions present in submission 1. No scored-region SLOT detected. No multi-epoch val training. VERDICT: PURE_NEURAL_CLEAN --- ## Submission 2: 2026-04-01_TurboMuon_EngramLite_Improved (val_bpb: 1.1431) EngramLite n-gram hash (lines 887–907): Multi-head bigram+trigram hashing over input_ids and prev_ids (shifted context). Example: python bi_h0 = (prev_ids * 1009 + input_ids) % B tri_h0 = ((pp_ids * 36313) ^ (prev_ids * 27191) ^ (input_ids * 4903)) % B XOR is used for hash mixing but only among context tokens (pp_ids, prev_ids, input_ids). target_ids is never incorporated into the hash key. No illegal target-leakage. TTT (lines 1261–1562): eval_val_sliding_ttt() implements score-first TTT. Structure is canonical PR #1413 pattern: - PHASE 1 (lines 1384–1456): Score each chunk under torch.no_grad() / train(False) — loss accumulated before any weight update. - PHASE 2 (lines 1469–1535): Train on scored chunk only; guarded by is_last = ci == num_chunks - 1 / `if not is_last...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

SergheiBrinza added a commit to SergheiBrinza/parameter-golf that referenced this pull request Apr 24, 2026
Personal case-study of my participation in the OpenAI Model Craft
Challenge, plus the April Turbo-Muon submission brought to main so
internal links resolve.

Contents:
- README.md: personal narrative and results tables
- docs/METHODS.md: technical breakdown of each technique used
- docs/EXPERIMENTS.md: verified runs and post-mortem of 020_ultimate
- docs/UPSTREAM_README.md: original OpenAI README preserved for context
- scripts/plot_curves.py: build training curves from train_*.log
- assets/loss_curves.png: training dynamics of both submissions
- Rewritten README for the 2026-03-21 submission
- Full 2026-04-01 Turbo-Muon submission ported from the PR branch:
  README, submission.json, train_gpt.py, three seed logs

Results on main:
- 2026-03-21 Mixed Quantization + BigramHash + SWA: val_bpb 1.2421
- 2026-04-01 Turbo-Muon + EngramLite (3 seeds, std 0.0007): val_bpb 1.1431

Upstream PRs:
- openai#370
- openai#1205
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants