Non-record: MoE exploration + multi-bit quantization analysis by imyesung · Pull Request #480 · openai/parameter-golf

imyesung · 2026-03-23T00:25:02Z

Summary

Non-record submission with two negative results under the 16MB artifact cap:

Preliminary MoE negative result: a 2-expert soft-routing MoE (2 × 1.5x MLP) underperforms the dense control throughout the observed training window. I added moe_train_partial.log, the surviving partial 8xH100 SXM log; the RunPod pod died at step 2000, so the MoE conclusion should be interpreted as preliminary rather than a fully converged final result.
Leaderboard-relevant multi-bit quantization comparison: the dense control reaches 1.1456 val_bpb, which is within 0.0028 BPB of the March 20, 2026 leaderboard leader (1.1428). On that same trained dense model, int5 MLP costs +0.0068 BPB while int4 MLP costs +0.0655 BPB, making aggressive quantization unattractive for MoE expansion at this scale.

Included evidence

README.md with updated explanation and MoE-vs-dense checkpoint table
submission.json with updated metadata
train.log for the dense control / quantization comparison
moe_train_partial.log for the surviving MoE run
train_gpt.py
quant_comparison.png

Quantization Comparison Results

Config	Attn	MLP	Artifact	val_bpb	vs baseline
attn6_mlp6	int6	int6	15.14 MB	1.1456	baseline
attn6_mlp5	int6	int5	13.39 MB	1.1524	+0.0068
attn6_mlp4	int6	int4	11.51 MB	1.2111	+0.0655
attn5_mlp5	int5	int5	13.05 MB	1.1559	+0.0103
attn5_mlp4	int5	int4	11.29 MB	1.2183	+0.0727

MoE Observed Checkpoints

Step	Dense control val_bpb	MoE val_bpb	Delta
500	1.4058	1.4115	+0.0057
1000	1.3286	1.3386	+0.0100
1500	1.3024	1.3163	+0.0139
2000	1.2709	1.2866	+0.0157

…n analysis Negative result showing MoE is structurally disadvantaged below 500M params under 16MB constraint. Multi-bit quantization comparison (int4/5/6) on same trained dense model demonstrates int4 MLP incurs +0.065 BPB degradation, closing the MoE parameter expansion path.

MatoTeziTanka · 2026-04-11T20:02:34Z

Community Review — Non-record: MoE exploration + multi-bit quantization analysis

BPB: 0.0028 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 85f3399ec31e, file records/track_non_record_16mb/2026-03-23_MoE_MultibitQuant_Analysis/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=55906 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=55906 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

imyesung added 2 commits March 23, 2026 09:17

fix: Update README format and chart styling

d7da256

notapplica mentioned this pull request Mar 23, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

fix: Add MoE evidence and leaderboard relevance

85f3399

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MoE exploration + multi-bit quantization analysis#480

Non-record: MoE exploration + multi-bit quantization analysis#480
imyesung wants to merge 3 commits intoopenai:mainfrom
imyesung:moe-quant-analysis

imyesung commented Mar 23, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imyesung commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Included evidence

Quantization Comparison Results

MoE Observed Checkpoints

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: MoE exploration + multi-bit quantization analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

imyesung commented Mar 23, 2026 •

edited

Loading