Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523#1529
Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523#1529msisovic wants to merge 12 commits intoopenai:mainfrom
Conversation
|
Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs. |
Awesome, thanks for noticing! I reran it with 13s reserved, and as expected it didn't noticeably change the score. However, I have noticed that I have on accident ran all three runs with seed 1337, so I have corrected that as well, which was a bit of a hit on the score, but it still clears the bar. |
…1.0752 Systems-level optimizations (fused Muon, EMA foreach, loader prealloc) on PR openai#1529's dual-lane parallel residual architecture. Identical ML; faster step time yields extra training steps. 3-seed mean: 1.0752 BPB / 2.7773 nats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…0639 Systems-level optimizations (fused Muon, EMA foreach, loader prealloc) on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals. Identical ML; faster step time yields extra training steps. 3-seed mean: 1.0639 BPB / 3.0705 nats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I've inlined the custom cutlass kernel from the base PR into my train_gpt script to avoid any compliance issues. The core logic remains unchanged, to keep it fair to submissions that came in the meantime, the results are even a bit worse with the new run, probably down to GPU cluster variance. |
I hope it helps to get it approved. Good luck ! |
First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Codex <noreply@openai.com>
* Update parameter golf leaderboard with BOS fix Co-authored-by: Codex <noreply@openai.com> * Credit PR 1797 in leaderboard update Co-authored-by: Codex <noreply@openai.com> * Credit CaseOps and PR 1787 leaderboard rows Co-authored-by: Codex <noreply@openai.com> * Apply p-value progression leaderboard cutoff Co-authored-by: Codex <noreply@openai.com> * Address leaderboard review comments Co-authored-by: Codex <noreply@openai.com> * Clarify BOS fix leaderboard evidence Co-authored-by: Codex <noreply@openai.com> * Shorten leaderboard p-value notes Co-authored-by: Codex <noreply@openai.com> * Remove non-frontier leaderboard rows Co-authored-by: Codex <noreply@openai.com> * Clarify SmearGate BOS fix attribution Co-authored-by: Codex <noreply@openai.com> * Exclude #1518 from chronological frontier Co-authored-by: Codex <noreply@openai.com> * Use submitted #1855 score Co-authored-by: Codex <noreply@openai.com> * Restore #1529 chronological frontier Co-authored-by: Codex <noreply@openai.com> * Restore #1529 chronological frontier Co-authored-by: Codex <noreply@openai.com> --------- Co-authored-by: Codex <noreply@openai.com>
cocohearts
left a comment
There was a problem hiding this comment.
This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.
cocohearts
left a comment
There was a problem hiding this comment.
This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.
Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.
Thanks for the review, will address this shortly. |
* Update parameter golf leaderboard with BOS fix Co-authored-by: Codex <noreply@openai.com> * Credit PR 1797 in leaderboard update Co-authored-by: Codex <noreply@openai.com> * Credit CaseOps and PR 1787 leaderboard rows Co-authored-by: Codex <noreply@openai.com> * Apply p-value progression leaderboard cutoff Co-authored-by: Codex <noreply@openai.com> * Address leaderboard review comments Co-authored-by: Codex <noreply@openai.com> * Clarify BOS fix leaderboard evidence Co-authored-by: Codex <noreply@openai.com> * Shorten leaderboard p-value notes Co-authored-by: Codex <noreply@openai.com> * Remove non-frontier leaderboard rows Co-authored-by: Codex <noreply@openai.com> * Clarify SmearGate BOS fix attribution Co-authored-by: Codex <noreply@openai.com> * Exclude openai#1518 from chronological frontier Co-authored-by: Codex <noreply@openai.com> * Use submitted openai#1855 score Co-authored-by: Codex <noreply@openai.com> * Restore openai#1529 chronological frontier Co-authored-by: Codex <noreply@openai.com> * Restore openai#1529 chronological frontier Co-authored-by: Codex <noreply@openai.com> --------- Co-authored-by: Codex <noreply@openai.com>
|
@cocohearts Comment addressed, should be ready for merge now. |
- Parallel residuals: credit PR openai#1204 (first) and PR openai#1529 (layer 8) - Depth recurrence: clarify 2-pass origin (openai#1344), 3-pass is final config - Loop curriculum frac=0.35: correct PR to openai#1420 (not openai#1344) - SmearGate: note original PR openai#162, SP8192 reintroduction in openai#1667 - Attention output gate: PR openai#1667 first, PR openai#1787 narrowed it - LeakyReLU²: replaced relu², not GELU - EMA decay: remove specific wrong value, defer to PR openai#287 - GPTQ first: correct to PR openai#535 (not openai#1019) - Global SGD phase: 2000 docs (openai#1610) refined to 2500 (openai#1626) - Training closing line: "halfway point" → "35% mark" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Record: Improved Parallel Residuals
val_bpb: 1.07578747 (3-seed mean, std 0.0007) | 2.77887078 nats | ~15.98 MB | 8xH100 SXM, 600s | Legal TTT
This submission starts from PR #1523. Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation.
The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block:
That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into
lane0, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixedlane0/x0path, while MLP reads rawlane1. Final output uses the mean of the two lanes.In practice, that is pretty much the only modeling change here versus PR #1523, together with moving
PARALLEL_RESIDUAL_STARTfrom the baseline's7to8. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed the CUTLASS EVT path to recover the full throughput. In this iteration the CUDA/C++ source is inlined into the training script itself and built against a standard/opt/cutlasscheckout rather than shipping a separate prebuilt.so.Results (8xH100 80GB SXM, 600s)
Reproducibility