Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460 by ibarrajo · Pull Request #1233 · openai/parameter-golf

ibarrajo · 2026-04-01T22:28:00Z

Summary

Focal loss (1-p)^gamma * CE with gamma=2.0 replaces standard cross-entropy during training, down-weighting easy tokens to focus gradient signal on hard tokens
Built on Approach B baseline (Int5 GPTQ + 33.6M params + SWA + XSA + VE + TTT)
Inspired by PR Loss function comparison (CE vs P2 variants) under parameter-golf constraints #1180 which used P2 loss (1-p)^2 among other techniques

Results

Config	val_bpb	Notes
Approach B baseline (B6)	1.1179	No focal loss
Approach H (focal, gamma=2.0) + TTT	1.1460	TTT s_0 score
Approach H (focal, gamma=2.0) base	1.1537	Before TTT

Delta: +0.028 BPB vs baseline — focal loss hurts at gamma=2.0.

Analysis: Why Focal Loss Hurts

Focal loss at gamma=2.0 over-suppresses gradients from well-predicted tokens. In language modeling (unlike object detection where focal loss originated), even "easy" tokens carry useful distributional signal. The (1-p)^2 factor reduces their gradient contribution too aggressively, slowing overall learning. A lower gamma (0.5-1.0) or curriculum-style scheduling might work better, but was not explored.

Key Changes

Single-line change in forward(): loss = ((1 - (-ce).exp()).pow(gamma) * ce).mean()
FOCAL_GAMMA env var (default 2.0, set to 0.0 for standard CE)
No architecture, eval, or artifact size changes

Rule Compliance

Test Plan

Verified focal loss implementation matches standard CE when gamma=0
Confirmed artifact size unchanged from baseline
Full 8xH100 training run completed within time budget

🤖 Generated with Claude Code

…460) Replaces standard cross-entropy with focal loss (1-p)^2 * CE during training to down-weight easy tokens and focus gradient on hard tokens. Built on Approach B (Int5 GPTQ + 33.6M params). Focal loss at gamma=2.0 hurts BPB by +0.028 vs baseline, suggesting the technique over-suppresses gradients from well-predicted tokens that still carry useful signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:57:20Z

Community Review — Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis ### PR #1233 — Approach H: Focal Loss + Int5 GPTQ + 33.6M params Head SHA: `65c71c5` --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAR — Not present. Lines 495–501 define `BigramHashEmbedding.bigram_hash()`. The XOR at line 500 is: `out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod` Here `t[..., 1:]` is the current token and `t[..., :-1]` is the previous token — both are input tokens. The hash is computed inside `forward(token_ids)` (line 503), which receives `input_ids`, not targets. No target token is XOR'd into the hash key. This is a standard causal bigram hash. No CLOSE here. --- ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) CLEAR — Not present. The only TTT path is `eval_val_sliding_ttt()` (lines 767–937). The docstring (line 776) explicitly states "Legal score-first TTT: score each chunk, then train on it." The implementation enforces this in two phases per chunk: - Phase 1 (Score): lines 843–878 — runs under `torch.inference_mode()`, scores all windows in chunk `ci` and accumulates loss/token/byte counts. - Phase 2 (Train): lines 880–914 — only runs if `not is_last_chunk` (line 882), training on the chunk's tokens after they have already been scored. No pre-quant pattern found. --- ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) PRESENT and CORRECT. - `is_last_chunk = (ci == num_chunks - 1)` at line 881. - Training gate: `if not is_last_chunk and ttt_epochs > 0:` at line 882 —...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Per-token NLL rescaled by detached, clipped, mean-1-normalized ratio of own NLL to batch-mean NLL, raised to alpha (warmup-ramped). Bit-identical to PR openai#1413 (1.0810 main frontier) when LOSS_REWEIGHT_ALPHA=0. Patch is 4 surgical edits to PR openai#1413 train_gpt.py: hyperparameters (+4 env vars), GPT.__init__ (+_train_step buffer), GPT.forward (constant-branch on alpha==0 else weighted CE), step_fn (fill _train_step each step). Wrapped LZMA script grew 308 bytes; tightest base seed keeps ~7.7KB headroom under 16MB cap. README acknowledges prior negative results (PR openai#1360 Gaussian reweight, PR openai#1233 focal gamma=2, PR openai#1380 focal investigation) and frames this as replication on a stronger TTT-heavy base where train-time hardness focus could interact with eval-time TTT in ways the older bases can't show.

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-h

ibarrajo commented Apr 1, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibarrajo commented Apr 1, 2026

Summary

Results

Analysis: Why Focal Loss Hurts

Key Changes

Rule Compliance

Test Plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants