BPB-weighted training loss: align training objective with eval metric#1519
BPB-weighted training loss: align training objective with eval metric#1519elliottdehn wants to merge 1 commit intoopenai:mainfrom
Conversation
|
Tagging myself, as @definenoob is my alt account |
|
Apologies, I had an out of date repo. Ignore this PR. |
|
Thanks for writing this up @elliottdehn — a three-line training-time loss reweight tied to the eval metric is exactly the kind of clean, composable technique proposal worth landing on its own merits, even before an official 8xH100 run. A few notes from reading the diff at What the change actually isThree edits to
On the BPB numbersJust to clear up the numbers for anyone skimming (I had to squint at them too): the A couple of questions on that front:
Compliance audit
CPU gauntlet (CT2038 proteus-engine, 2026-04-11)Import clean, param count identical to baseline (17,059,912), random-init forward loss 6.9354 — all consistent with "no architectural change, training-side loss reweight only." The wrapper's artifact check reports FAIL at 68MB because the CPU pre-flight dumps raw fp32 state without running the int8+zlib path — this is the standard baseline behavior on the CPU gauntlet and not a PR-specific regression. VerdictThis looks like a clean, small, composable technique contribution. Happy to see the 3-seed 8xH100 numbers land; the preliminary delta is large enough that even a significant haircut on the official run would still make this worth keeping as an orthogonal training-side lever. The framing of aligning the training objective with the eval metric is the right way to motivate it, and the three-line implementation (plus one strict-mode relaxation) is minimally invasive. Non-blocking suggestions if you do a follow-up push:
Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet (CT2038 proteus-engine, 2026-04-11): 17,059,912 params, forward loss 6.9354, import+forward clean. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
Community Review — BPB-weighted training loss: align training objective with eval metricCompliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache PR #1519 modifies N-gram / XOR family bug check — CLEAR TTT / Pre-Quant TTT check — CLEAR Scored-region SLOT check — CLEAR What the PR actually does:
No rule violations found. The submission is a pure neural training improvement. Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
…1716) Two orthogonal training-time levers queued behind spec 011: - bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token. Aligns training objective with eval metric. Risk: SP8192 vocab destabilization (author warns on large vocabs) + CaseOps byte LUT accounting (~1hr of careful code). - bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed added to token embedding pre-block-0. ~540K params / ~400KB artifact. openai#1736 genuinely lacks this despite prevalence in competitive lineages. Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk) → 014 (BPB-weighted, higher risk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Endpoint val_bpb +0.0025 vs spec 008 on seed 42. Outside ±0.0015 single-seed 95% CI on unfavorable side. Confounded by RNG-stream drift (+0.0323 train_loss gap at step 500 — ~50× larger than pre-registered expectation for a zero-init projection). Mid-training gap-closes to +0.0021 by step 3500 but endpoint remains unfavorable. Decision: shelve for this push. RNG-control retry doesn't reflect shipping reality (authors don't RNG-control either). 3-seed confirmation (~$60) is 40% of remaining budget — not warranted for a lever whose single-seed point is already on the wrong side. Next: spec 014 (BPB-weighted CE, port openai#1519) moves to front of queue. Cost: ~$5. Running total ~$133 remaining. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pinned to commit ab6a131 on exp/bpb-weighted. Single-seed screening on 8×H100 after required 2H smoke (SP8192 destabilization risk is real per openai#1519's explicit warning — no skip-smoke gamble this time). Uses base_bytes_lut (surface-piece bytes) as CaseOps approximation. TTT path left untouched. Expected Δ: −0.002 to −0.005 if transfers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Weight each token's cross-entropy loss by the number of UTF-8 bytes it encodes, directly aligning the training objective with the BPB evaluation metric. Three lines changed in
train_gpt.py.The Change
_byte_weightsisbase_bytes_lut.float().clamp(min=1.0), registered as a buffer after model creation.base_bytes_lutalready exists in the codebase for BPB evaluation — we just reuse it during training.Why It Works
The eval metric is bits per byte, but standard CE weights all tokens equally. A token encoding "the " (4 bytes) contributes 4x to BPB compared to a single-byte token, but gets equal weight in training. BPB-weighting shifts gradient toward multi-byte tokens that matter most for the eval metric.
Works specifically because the 1024-token SentencePiece vocab has gentle byte-length variance (1-8x). Verified it does NOT work with large vocabularies (GPT-2 50K) where extreme byte lengths destabilize training.
Preliminary Results (2x RTX 5090, SDPA fallback)
Compared against unmodified baseline on same hardware, same seed:
Post-EMA: 1.1146 vs current record's post-EMA 1.1340. Delta: -0.0194 bpb.
Gap widens monotonically through training.
Status
Preliminary / non-record submission. Results are on 2x RTX 5090, not the official 8xH100 environment. Awaiting compute grant for:
Will update this PR with official results once H100 runs are complete.
Properties
🤖 Generated with Claude Code