Skip to content

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-g
Open

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-g

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Apr 1, 2026

Summary

Results

Config val_bpb Notes
Approach B baseline (B6) 1.1179 Training-data GPTQ calibration
Approach G (self-gen GPTQ) + TTT 1.1461 TTT s_0 score
Approach G (self-gen GPTQ) base 1.1559 Before TTT

Delta: +0.028 BPB vs baseline — self-gen GPTQ loses net.

Analysis: Why Self-Gen GPTQ Loses

The technique requires reserving ~210s for AR generation + Hessian collection, leaving only 390s for training (vs 590s baseline). This loses ~30% of training steps. While self-generated calibration data better matches the model's inference-time activation distribution, the quantization improvement (~0.002-0.003 BPB) is far smaller than the loss from fewer training steps (~0.03 BPB). The technique would become net positive if:

  1. AR generation were faster (batched generation, shorter sequences)
  2. Training were more sample-efficient (higher LR, better schedule)
  3. The training budget were longer (giving a smaller relative time cost)

Key Changes from Approach B

  1. generate_autoregressive_calib() — generates 64 sequences of 2048 tokens at temp=0.8
  2. collect_hessians_from_tokens() — collects H = X^T X from self-generated sequences
  3. Training budget reduced from 590s to 390s to accommodate generation time
  4. Int6 GPTQ with Cholesky error compensation and column reordering

Architecture

  • 11 layers, dim=512, 8 heads, 8 KV heads
  • BigramHash embedding (6144 x 128), Value embeddings
  • XSA on all layers, SmearGate, U-Net skip connections
  • ReLU^2 MLP (3.5x width)

Rule Compliance

  • Training <= 600s on 8xH100 (390s train + 210s AR gen + GPTQ)
  • Eval <= 600s
  • Artifact <= 16,000,000 bytes
  • No val tokens in artifact
  • No training data accessed during quantization (self-generated only)
  • GPTQ calibration within training budget
  • TTT is score-first only (s_0 reported)
  • Single-pass evaluation

Test Plan

  • Verified AR generation produces coherent text (not garbage)
  • Confirmed no training/val data accessed during GPTQ calibration
  • Full 8xH100 training run completed within time budget
  • Artifact fits under 16MB

🤖 Generated with Claude Code

Model generates its own GPTQ calibration data (64 seqs x 2048 tokens,
temp=0.8) after training, eliminating need for training data at eval time.
Built on Approach B base. The 390s training budget (vs 590s) to reserve
time for AR generation loses more from fewer training steps than it gains
from better-matched calibration distributions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #1234 — AR Self-Gen GPTQ + Int6 + XSA + TTT Head SHA: 8c3f820 File audited: records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/train_gpt.py (1626 lines) --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAR. The BigramHash at line 494–500 hashes using only input tokens: python out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod t is tokens (the input token sequence), not targets. Targets are not involved in the hash at any point. The forward() (line 502) receives token_ids (input), not target_ids. This is standard BigramHash — no ILLEGAL pattern. --- ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) CLEAR. There is no TTT applied before quantization. The TTT function eval_val_sliding_ttt is only called after quantization (line 1609), on the final INT6 model. The quantization step uses AR self-generated calibration tokens only (lines 1024–1042): the model generates its own sequences via torch.multinomial; no val_tokens are accessed during GPTQ. --- ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) CONFIRMED PRESENT AND CORRECTLY IMPLEMENTED. eval_val_sliding_ttt (line 758) implements the PR #1413 pattern faithfully: - Phase 1 (line 828): torch.inference_mode() scoring block — all tokens in a chunk are scored with no gradient. - Phase 2 (line 872): Training only begins after scoring completes. Gated by is_last_chunk guard at line 872–873: python is_last_chunk = (ci == num_chunks - 1) if not is_last_chunk and ttt_epochs > 0: The final chunk is scored but never trained on, preventing future-data leakage. Every token...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants