Skip to content

Record: PR1851 + 9-hparam stack + wd_strong + GPTQ AR + pergroup - val_bpb 1.05957 (1 seed)#2020

Open
Itssshikhar wants to merge 3 commits intoopenai:mainfrom
Itssshikhar:run6-pergroup-record
Open

Record: PR1851 + 9-hparam stack + wd_strong + GPTQ AR + pergroup - val_bpb 1.05957 (1 seed)#2020
Itssshikhar wants to merge 3 commits intoopenai:mainfrom
Itssshikhar:run6-pergroup-record

Conversation

@Itssshikhar
Copy link
Copy Markdown

Summary

val_bpb: 1.05957 (seed 42) | 15,901,624 bytes | 8xH100 SXM, 600s | Phased LoRA TTT

Built on the PR #1851 stack. Key additions: PR #1855's 9-hparam stack, stronger Muon weight-decay schedule, GPTQ all-rank Hessian averaging, and PR #1855-style pergroup
lrzip+brotli compression ported into the PR #1851 graph.

Couldn't make it into a 3-seed mean as Runpod-credits ran out. took into accounr the discussion on CaseOps in prev PRs, tho since they got merged, I went ahead with it.

Results (8xH100 80GB SXM, 600s, phased TTT)

Seed Steps ms/step Pre-quant BPB Quant BPB TTT BPB Artifact
42 4,844 122.2 1.06335 1.07246 1.05957 15,901,624

Delta vs PR #1855 seed 42 (1.05989): -0.00033 BPB.
Delta vs PR #1855 3-seed mean (1.06108): -0.00151 BPB.

Key Techniques

  1. PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 graph preserved - keeps the BOS-fixed SmearGate + LQER asymmetric + PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 SparseAttnGate/PolarNS/FusedCE stack.

  2. PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 9-hparam stack - transfers the accepted greedy hparam overrides onto the PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851-derived graph.

  3. wd_strong - stronger Muon WD schedule with low=0.5 and high=1.75.

  4. GPTQ all-rank Hessian averaging - averages GPTQ calibration Hessians across ranks.

  5. Pergroup compression port - ports PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855's lrzip+brotli per-group compressor into the PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 graph, making it under the 16MB cap.

Numbers

Run Graph Compressor TTT BPB Artifact Valid
Run-1 PR #1851 + 9hp + wd_strong + AR brotli 1.05950 16,140,607 No
This PR PR #1851 + 9hp + wd_strong + AR pergroup 1.05957 15,901,624 Yes

The compressor swap costs only +0.00006 BPB while saving 238,983 bytes total, making this valid.

Reproduction

RUN_ID=top_pr1855_hparams_s42_pergroup SEED=42 \
CASEOPS_ENABLED=1 EMBED_BITS=7 \
SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 \
MIN_LR=0.1 GPTQ_RESERVE_SECONDS=8.0 \
PHASED_TTT_NUM_PHASES=3 GPTQ_ALL_REDUCE=1 \
WD_SCHEDULE_ENABLED=1 WD_SCHED_LOW_FACTOR=0.5 WD_SCHED_HIGH_FACTOR=1.75 \
EMBED_CLIP_SIGMAS=14.0 MLP_CLIP_SIGMAS=11.5 \
WARMDOWN_FRAC=0.85 BETA2=0.99 \
TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 \
SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500 \
COMPRESSOR=pergroup \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

- Seed 42 validation on 8xH100 SXM
- Artifact under 16,000,000 bytes
- Training wallclock stop at 592.1s
- Full pipeline: train -> EMA -> GPTQ/LQER -> pergroup compress -> decompress -> quant eval -> phased TTT eval

@cocohearts
Copy link
Copy Markdown
Collaborator

Leaderboard audit note (pre-cutoff state): I don't think this is record-ready as submitted. The headline is a single-seed result with no std/p-value evidence. For a score this close to the existing frontier, it needs a matching 3-seed package and significance evidence before it can be treated as a leaderboard row.

@Itssshikhar
Copy link
Copy Markdown
Author

hey @cocohearts. im aware of how close this seed is from the current top, but as i mentioned in the description, im out of runpod credits. Is there a way to run 3-seed mean on this submission to make this decisive?

Itssshikhar and others added 2 commits May 3, 2026 18:39
3-seed mean val_bpb = 1.06017968 (seeds 42, 0, 1234) on the published
train_gpt.py + env block. Seed 42 reproduces (1.05948583 vs README
1.05956571). Honest delta vs PR openai#1855 3-seed mean is -0.00090, not
the README's headline -0.00151 (which compares this candidate's best
seed to PR openai#1855's mean). All artifacts under the 16 MB cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Full per-seed verbose logs (seed42.log, seed0.log, seed1234.log) with
  hparams + source dump + per-step training trace + TTT phase details,
  matching parent record's train_seedXX.log convention.
- submission.json with machine-readable per-seed and 3-seed-mean numbers,
  artifact bytes (max 15,909,242 / cap 16,777,216), and per-seed deltas
  vs PR openai#1855.
- Banner at top of parent README pointing readers to three_seed_eval/
  so the corrected 1.06018 mean is visible alongside the 1.05957 headline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Itssshikhar
Copy link
Copy Markdown
Author

i was able to run 3-seed mean for the submission with no addition changes + logs. here are the numbers:

Per-seed results

Seed Pre-quant Post-quant Post-TTT Artifact (bytes) Steps
42 1.06324 1.07238 1.05949 15,899,339 4862
0 1.06411 1.07321 1.06029 15,903,214 4849
1234 1.06430 1.07363 1.06076 15,909,242 4878
mean 1.06018 15,903,932
  • 3-seed stdev: 0.00064
  • 3-seed spread (max−min): 0.00127
  • All artifacts under cap. Tightest margin is seed 1234 with 867,974 B of headroom.

@cocohearts let me know if this is enough to close the leaderboard score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants