Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182) by ibarrajo · Pull Request #1004 · openai/parameter-golf

ibarrajo · 2026-03-28T05:36:42Z

Summary

val_bpb: 1.1182 (s_0 score only, single seed — additional seeds pending)

Resubmission addressing PR #991's closure. Key fix: reports ONLY the cumulative s_0 score from the first scoring pass. No post-TTT re-evaluation. No temperature calibration on re-scored tokens.

What changed from PR #991

Removed illegal post-TTT re-eval — PR Record: 33.6M Int5 GPTQ + Score-First TTT (val_bpb=1.1145, 3-seed) #991 reported s_1 (re-scored after training). This PR reports s_0 (scored before training on each chunk).
Removed temperature calibration — T=0.98 on re-scored tokens was illegal. Removed entirely.
Increased pruning 3%→5% — ensures artifact <16MB across all seeds.
All assertions pass: train+gptq < 600s, artifact < 16MB, eval < 600s.

Results

Metric	Value
Base (no TTT, sliding window)	1.1246
Legal s_0 TTT	1.1182
TTT improvement	-0.0064
Artifact	15,535,414 bytes (465KB headroom)
Train+GPTQ	593.8s / 600s
Eval	~414s / 600s

Rule compliance

s_0 only — each token scored BEFORE training, cumulative loss reported
No re-scoring — no second eval pass after TTT
No temperature calibration — removed
GPTQ within training budget — 593.8s total
Artifact < 16MB — 15.5MB with 465KB headroom
Eval < 600s — ~414s
Assertions enforce all constraints at runtime

Architecture

33.6M params (d=576, MLP 3.5x=1792, 11L), int5 GPTQ, XSA-all(11), BigramHash(8192), EMA(0.997), 5% magnitude pruning. Based on PR #576 by @cmcdnd.

🤖 Generated with Claude Code

Train larger (33.6M params, d=576, MLP 3.5x), quantize harder (int5 GPTQ). Legal score-first TTT (AdamW, cosine LR, 3 epochs) + post-TTT temperature calibration (T=0.98). 3-seed mean 1.1145 BPB (std 0.0003). Based on PR openai#576. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Train 590s + GPTQ 3.8s = 593.9s < 600s (within budget) - 3% pruning → artifact 15.3MB with 711KB headroom - Added assertions: artifact < 16MB, train+gptq < 600s, eval < 600s - Seed 1337: val_bpb=1.1148 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Seed 1337: 1.1148 BPB, artifact 15.3MB, train+gptq 593.9s Seed 42: 1.1154 BPB, artifact 15.3MB, train+gptq 593.7s Seed 2025: 1.1148 BPB, artifact 15.8MB, train+gptq 593.9s Mean: 1.1150 (std 0.0003) All seeds: artifact < 16MB, train+gptq < 600s, eval < 600s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Reports ONLY s_0 (cumulative first-pass score) — no re-eval after TTT - 5% pruning → artifact 15.5MB (465KB headroom) - Train+GPTQ: 593.8s < 600s - Eval (sliding + TTT): ~414s < 600s - Addresses PR openai#991 closure: removed illegal post-TTT re-scoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Non-record. All assertions pass. Legal s_0-only TTT. Artifact 15.5MB (516KB headroom). Train+GPTQ 593.7s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ibarrajo · 2026-04-01T07:03:51Z

Updated with B6 run: val_bpb = 1.1179 (s_0 TTT, seed 1337).

Changes from previous:

10% magnitude pruning (5% and 7% both overflowed 16MB on some seeds)
BigramHash reduced 8192→6144 for artifact headroom
Artifact: 15.5MB (516KB headroom)
Train+GPTQ: 593.7s / 600s
inference_temp = 1.0 (no temperature calibration)
Reports s_0 only — no re-scoring

Non-record submission (SOTA is 1.1147).

ibarrajo · 2026-04-02T08:21:08Z

Updated results summary — B6 is our best legal approach at val_bpb=1.1179 (s_0 TTT, seed 1337).

We tested 12 approaches this session exploring the design space:

Approach	Best BPB	Key technique	Finding
B6	1.1179	s_0 TTT, 8KV, int5, 10% prune	Best overall
J (GQA+SLOT)	1.1240	GQA 4KV + SLOT eval	GQA hurts TTT/SLOT capacity
I (GQA+TTT)	1.1264	GQA 4KV + LZMA + TTT	Faster steps but weaker TTT
E (SLOT)	1.1179	SLOT delta opt, QK-Gain 4.0	Ties B6, QK-Gain 4.0 slightly hurts
L (online GPTQ)	1.1349	Online Hessian accumulation	Overhead > savings
D (TurboQuant)	1.1521	Mixed int4/int5 per role	Weight quant ≠ activation quant
H (focal loss)	1.1460	P2 focal loss γ=2	Too aggressive, hurts training

Key findings:

GQA (4 KV heads) improves base BPB but reduces TTT/SLOT capacity
SLOT gives ~0.009 BPP improvement (similar to TTT)
Online Hessian GPTQ adds too much per-step overhead to be net positive
Fused Triton MLP kernel didn't help because torch.compile already fuses the ops

All approaches use GPTQ within the training budget. Non-record (SOTA is 1.1147). Needs 2 more seeds for statistical validation.

MatoTeziTanka · 2026-04-12T13:15:01Z

Community Review — Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

GPTQ on train data, score-first TTT with is_last_chunk guard. Clean.

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

ibarrajo and others added 5 commits March 27, 2026 17:03

B6: 1.1179 BPB (s_0 TTT, 10% prune, BigramHash 6144)

2210325

Non-record. All assertions pass. Legal s_0-only TTT. Artifact 15.5MB (516KB headroom). Train+GPTQ 593.7s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182)#1004

Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182)#1004
ibarrajo wants to merge 5 commits intoopenai:mainfrom
ibarrajo:approach-b

ibarrajo commented Mar 28, 2026

Uh oh!

ibarrajo commented Apr 1, 2026

Uh oh!

ibarrajo commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibarrajo commented Mar 28, 2026

Summary

What changed from PR #991

Results

Rule compliance

Architecture

Uh oh!

ibarrajo commented Apr 1, 2026

Uh oh!

ibarrajo commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: 33.6M Int5 GPTQ + Legal s_0-only TTT (val_bpb=1.1182)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants