Skip to content

Non-record: MoE exploration + multi-bit quantization analysis#480

Open
imyesung wants to merge 3 commits intoopenai:mainfrom
imyesung:moe-quant-analysis
Open

Non-record: MoE exploration + multi-bit quantization analysis#480
imyesung wants to merge 3 commits intoopenai:mainfrom
imyesung:moe-quant-analysis

Conversation

@imyesung
Copy link
Copy Markdown

@imyesung imyesung commented Mar 23, 2026

Summary

Non-record submission with two negative results under the 16MB artifact cap:

  • Preliminary MoE negative result: a 2-expert soft-routing MoE (2 × 1.5x MLP) underperforms the dense control throughout the observed training window. I added moe_train_partial.log, the surviving partial 8xH100 SXM log; the RunPod pod died at step 2000, so the MoE conclusion should be interpreted as preliminary rather than a fully converged final result.
  • Leaderboard-relevant multi-bit quantization comparison: the dense control reaches 1.1456 val_bpb, which is within 0.0028 BPB of the March 20, 2026 leaderboard leader (1.1428). On that same trained dense model, int5 MLP costs +0.0068 BPB while int4 MLP costs +0.0655 BPB, making aggressive quantization unattractive for MoE expansion at this scale.

Included evidence

  • README.md with updated explanation and MoE-vs-dense checkpoint table
  • submission.json with updated metadata
  • train.log for the dense control / quantization comparison
  • moe_train_partial.log for the surviving MoE run
  • train_gpt.py
  • quant_comparison.png

Quantization Comparison Results

Config Attn MLP Artifact val_bpb vs baseline
attn6_mlp6 int6 int6 15.14 MB 1.1456 baseline
attn6_mlp5 int6 int5 13.39 MB 1.1524 +0.0068
attn6_mlp4 int6 int4 11.51 MB 1.2111 +0.0655
attn5_mlp5 int5 int5 13.05 MB 1.1559 +0.0103
attn5_mlp4 int5 int4 11.29 MB 1.2183 +0.0727

MoE Observed Checkpoints

Step Dense control val_bpb MoE val_bpb Delta
500 1.4058 1.4115 +0.0057
1000 1.3286 1.3386 +0.0100
1500 1.3024 1.3163 +0.0139
2000 1.2709 1.2866 +0.0157

…n analysis

Negative result showing MoE is structurally disadvantaged below 500M params
under 16MB constraint. Multi-bit quantization comparison (int4/5/6) on same
trained dense model demonstrates int4 MLP incurs +0.065 BPB degradation,
closing the MoE parameter expansion path.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: MoE exploration + multi-bit quantization analysis

BPB: 0.0028 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 85f3399ec31e, file records/track_non_record_16mb/2026-03-23_MoE_MultibitQuant_Analysis/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=55906 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=9, vocab=1024, code=55906 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants