Skip to content

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-h
Open

Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460#1233
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-h

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Apr 1, 2026

Summary

Results

Config val_bpb Notes
Approach B baseline (B6) 1.1179 No focal loss
Approach H (focal, gamma=2.0) + TTT 1.1460 TTT s_0 score
Approach H (focal, gamma=2.0) base 1.1537 Before TTT

Delta: +0.028 BPB vs baseline — focal loss hurts at gamma=2.0.

Analysis: Why Focal Loss Hurts

Focal loss at gamma=2.0 over-suppresses gradients from well-predicted tokens. In language modeling (unlike object detection where focal loss originated), even "easy" tokens carry useful distributional signal. The (1-p)^2 factor reduces their gradient contribution too aggressively, slowing overall learning. A lower gamma (0.5-1.0) or curriculum-style scheduling might work better, but was not explored.

Key Changes

  1. Single-line change in forward(): loss = ((1 - (-ce).exp()).pow(gamma) * ce).mean()
  2. FOCAL_GAMMA env var (default 2.0, set to 0.0 for standard CE)
  3. No architecture, eval, or artifact size changes

Rule Compliance

  • Training <= 600s on 8xH100
  • Eval <= 600s
  • Artifact <= 16,000,000 bytes
  • No val tokens in artifact
  • GPTQ calibration within training budget
  • TTT is score-first only (s_0 reported)
  • Single-pass evaluation

Test Plan

  • Verified focal loss implementation matches standard CE when gamma=0
  • Confirmed artifact size unchanged from baseline
  • Full 8xH100 training run completed within time budget

🤖 Generated with Claude Code

…460)

Replaces standard cross-entropy with focal loss (1-p)^2 * CE during training
to down-weight easy tokens and focus gradient on hard tokens. Built on
Approach B (Int5 GPTQ + 33.6M params). Focal loss at gamma=2.0 hurts BPB
by +0.028 vs baseline, suggesting the technique over-suppresses gradients
from well-predicted tokens that still carry useful signal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis ### PR #1233 — Approach H: Focal Loss + Int5 GPTQ + 33.6M params Head SHA: 65c71c5 --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAR — Not present. Lines 495–501 define BigramHashEmbedding.bigram_hash(). The XOR at line 500 is: out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod Here t[..., 1:] is the current token and t[..., :-1] is the previous token — both are input tokens. The hash is computed inside forward(token_ids) (line 503), which receives input_ids, not targets. No target token is XOR'd into the hash key. This is a standard causal bigram hash. No CLOSE here. --- ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) CLEAR — Not present. The only TTT path is eval_val_sliding_ttt() (lines 767–937). The docstring (line 776) explicitly states "Legal score-first TTT: score each chunk, then train on it." The implementation enforces this in two phases per chunk: - Phase 1 (Score): lines 843–878 — runs under torch.inference_mode(), scores all windows in chunk ci and accumulates loss/token/byte counts. - Phase 2 (Train): lines 880–914 — only runs if not is_last_chunk (line 882), training on the chunk's tokens after they have already been scored. No pre-quant pattern found. --- ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) PRESENT and CORRECT. - is_last_chunk = (ci == num_chunks - 1) at line 881. - Training gate: if not is_last_chunk and ttt_epochs > 0: at line 882 —...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 26, 2026
Per-token NLL rescaled by detached, clipped, mean-1-normalized
ratio of own NLL to batch-mean NLL, raised to alpha (warmup-ramped).
Bit-identical to PR openai#1413 (1.0810 main frontier) when LOSS_REWEIGHT_ALPHA=0.

Patch is 4 surgical edits to PR openai#1413 train_gpt.py: hyperparameters
(+4 env vars), GPT.__init__ (+_train_step buffer), GPT.forward
(constant-branch on alpha==0 else weighted CE), step_fn (fill _train_step
each step). Wrapped LZMA script grew 308 bytes; tightest base seed
keeps ~7.7KB headroom under 16MB cap.

README acknowledges prior negative results (PR openai#1360 Gaussian reweight,
PR openai#1233 focal gamma=2, PR openai#1380 focal investigation) and frames this
as replication on a stronger TTT-heavy base where train-time hardness
focus could interact with eval-time TTT in ways the older bases can't show.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants