Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461) by ibarrajo · Pull Request #1234 · openai/parameter-golf

ibarrajo · 2026-04-01T22:28:04Z

Summary

AR self-generated GPTQ calibration: model generates its own calibration data (64 sequences x 2048 tokens, temp=0.8) after training, eliminating the need for training data at eval time
Built on Approach B base (Int6 GPTQ + 11L + XSA + TTT)
Technique confirmed legal by competition organizer on PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Results

Config	val_bpb	Notes
Approach B baseline (B6)	1.1179	Training-data GPTQ calibration
Approach G (self-gen GPTQ) + TTT	1.1461	TTT s_0 score
Approach G (self-gen GPTQ) base	1.1559	Before TTT

Delta: +0.028 BPB vs baseline — self-gen GPTQ loses net.

Analysis: Why Self-Gen GPTQ Loses

The technique requires reserving ~210s for AR generation + Hessian collection, leaving only 390s for training (vs 590s baseline). This loses ~30% of training steps. While self-generated calibration data better matches the model's inference-time activation distribution, the quantization improvement (~0.002-0.003 BPB) is far smaller than the loss from fewer training steps (~0.03 BPB). The technique would become net positive if:

AR generation were faster (batched generation, shorter sequences)
Training were more sample-efficient (higher LR, better schedule)
The training budget were longer (giving a smaller relative time cost)

Key Changes from Approach B

generate_autoregressive_calib() — generates 64 sequences of 2048 tokens at temp=0.8
collect_hessians_from_tokens() — collects H = X^T X from self-generated sequences
Training budget reduced from 590s to 390s to accommodate generation time
Int6 GPTQ with Cholesky error compensation and column reordering

Architecture

11 layers, dim=512, 8 heads, 8 KV heads
BigramHash embedding (6144 x 128), Value embeddings
XSA on all layers, SmearGate, U-Net skip connections
ReLU^2 MLP (3.5x width)

Rule Compliance

Training <= 600s on 8xH100 (390s train + 210s AR gen + GPTQ)
Eval <= 600s
Artifact <= 16,000,000 bytes
No val tokens in artifact
No training data accessed during quantization (self-generated only)
GPTQ calibration within training budget
TTT is score-first only (s_0 reported)
Single-pass evaluation

Test Plan

Verified AR generation produces coherent text (not garbage)
Confirmed no training/val data accessed during GPTQ calibration
Full 8xH100 training run completed within time budget
Artifact fits under 16MB

🤖 Generated with Claude Code

Model generates its own GPTQ calibration data (64 seqs x 2048 tokens, temp=0.8) after training, eliminating need for training data at eval time. Built on Approach B base. The 390s training budget (vs 590s) to reserve time for AR generation loses more from fewer training steps than it gains from better-matched calibration distributions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:50:51Z

Community Review — Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #1234 — AR Self-Gen GPTQ + Int6 + XSA + TTT Head SHA: `8c3f820` File audited: `records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/train_gpt.py` (1626 lines) --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAR. The `BigramHash` at line 494–500 hashes using only input tokens: `python out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod` `t` is `tokens` (the input token sequence), not targets. Targets are not involved in the hash at any point. The `forward()` (line 502) receives `token_ids` (input), not `target_ids`. This is standard BigramHash — no ILLEGAL pattern. --- ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) CLEAR. There is no TTT applied before quantization. The TTT function `eval_val_sliding_ttt` is only called after quantization (line 1609), on the final INT6 model. The quantization step uses AR self-generated calibration tokens only (lines 1024–1042): the model generates its own sequences via `torch.multinomial`; no `val_tokens` are accessed during GPTQ. --- ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) CONFIRMED PRESENT AND CORRECTLY IMPLEMENTED. `eval_val_sliding_ttt` (line 758) implements the PR #1413 pattern faithfully: - Phase 1 (line 828): `torch.inference_mode()` scoring block — all tokens in a chunk are scored with no gradient. - Phase 2 (line 872): Training only begins after scoring completes. Gated by `is_last_chunk` guard at line 872–873: `python is_last_chunk = (ci == num_chunks - 1) if not is_last_chunk and ttt_epochs > 0:` The final chunk is scored but never trained on, preventing future-data leakage. Every token...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-g

ibarrajo commented Apr 1, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ibarrajo commented Apr 1, 2026

Summary

Results

Analysis: Why Self-Gen GPTQ Loses

Key Changes from Approach B

Architecture

Rule Compliance

Test Plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants