Skip to content

Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)#141

Open
nglain wants to merge 1 commit intoopenai:mainfrom
nglain:submission/systematic-search
Open

Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)#141
nglain wants to merge 1 commit intoopenai:mainfrom
nglain:submission/systematic-search

Conversation

@nglain
Copy link
Copy Markdown

@nglain nglain commented Mar 20, 2026

Summary

Metric Value
Post-quant val_bpb 1.2075
Pre-quant val_bpb 1.2008
Compressed artifact ~15.2 MB
Training steps 7,390
Training time 600s (8×H100 SXM)

Approach

Methodical hyperparameter search through 33 experiments across three GPU tiers (A40 → 1×H100 → 8×H100), using fixed-seed paired comparison (SEED=1337) for reliable delta measurement (±0.001 BPB).

What works

  • Muon optimizer (lr=0.02, momentum=0.99, warmdown=3000): -0.005 BPB
  • ROPE_BASE=200000: -0.003 BPB
  • seq_len=4096: -0.006 BPB

What doesn't work

  • int6 STE + Muon: conflicts (+0.007 worse)
  • 12 layers: too slow, fewer steps
  • Larger batch (786K): fewer steps outweighs quality

Key insight

Optimal hyperparameters differ dramatically across compute budgets. The optimal LR on A40/2min (0.10) is 5× the optimal on 8×H100/10min (0.02). Parameters must be re-validated at target compute scale.

Changes from baseline

Only hyperparameters: MATRIX_LR=0.02, MUON_MOMENTUM=0.99, WARMDOWN_ITERS=3000, ROPE_BASE=200000, TRAIN_SEQ_LEN=4096. No architectural changes.

Test plan

  • Trained on 8×H100 SXM, 600s wallclock
  • final_int8_zlib_roundtrip val_bpb: 1.2075
  • Artifact under 16,000,000 bytes
  • train_gpt.py compiles and runs from records folder
  • train.log included

Methodical search through 33 experiments across A40, 1xH100, 8xH100.
Fixed-seed paired comparison (SEED=1337) for reliable delta measurement.

Key findings:
- Muon optimizer (lr=0.02, momentum=0.99, warmdown=3000): -0.005 BPB
- ROPE_BASE=200000: -0.003 BPB
- seq_len=4096: -0.006 BPB
- int6 STE conflicts with Muon optimizer (+0.007 worse)
- Hyperparameter transfer across compute scales is unreliable

val_bpb: 1.2075 (post-quant roundtrip)
Artifact: ~15.2 MB (under 16 MB cap)
Trained on 8xH100 SXM, 600s wallclock, 7390 steps
@MatoTeziTanka
Copy link
Copy Markdown

PR #141 Review

Title: Non-record: Systematic Hyperparameter Search (val_bpb=1.2075)
State: open
Date Reviewed: 2026-04-11

Code Analysis

train_gpt.py Checks

  • target-in-key pattern: not found
  • TTT (Temporal Token Tagging): not found
  • SLOT (Slot MoE): not found
  • Custom Tokenizer: not found

Verdict

Classification: PURE_NEURAL_CLEAN

Recommendation:

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: ✓ MERGE


Draft by @MatoTeziTanka for parameter-golf review sweep (2026-04-11)


Reviewed by @MatoTeziTankaThe Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants