Skip to content

GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb#610

Open
ChaosCodes wants to merge 1 commit intoopenai:mainfrom
ChaosCodes:submission/gptq-ttt-1119
Open

GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb#610
ChaosCodes wants to merge 1 commit intoopenai:mainfrom
ChaosCodes:submission/gptq-ttt-1119

Conversation

@ChaosCodes
Copy link
Copy Markdown

Summary

  • 11-layer 512d GPT with PR#414's 10-technique stack + LeakyReLU(0.5)² activation
  • GPTQ int6 quantization: Hessian-guided column-wise quantization replacing naive per-row rounding, reducing quantization error by 33.6% (Hessian-weighted MSE)
  • SGD test-time training (TTT): Continues training on validation data in a causal (score-first) manner with cosine LR decay, adapting last 9/11 layers

Key Results

Metric Value
A800 bpb (GPTQ + TTT) 1.1190
A800 bpb (GPTQ only) 1.1214
A800 bpb (sliding window, no TTT) 1.1243
Estimated H100 bpb ~1.122
Artifact size 15,750,888 bytes (98.4% of 16MB)
Training time 1200s on 8×A800-SXM4-80GB

Techniques

Architecture (PR#414 stack): XSA4, EMA, U-Net skip, SmearGate, BigramHash, PartialRoPE, LNScale, ValueEmbed, LateQAT, SWA

Novel contributions:

  1. LeakyReLU(0.5)² replacing GELU² — saves 0.0026 bpb by improving gradient flow
  2. GPTQ int6 — data-dependent quantization using 256 calibration samples, block-128 updates
  3. SGD TTT — Simple SGD (lr=0.002, momentum=0.9) with cosine schedule over 900 chunks of 32K tokens, 3 epochs/chunk

Compression

zstd level-21 with long-distance matching (LDM) for model artifact compression.

Files

  • train_gpt.py — Full training + GPTQ + TTT evaluation pipeline
  • eval_gptq.py — Standalone GPTQ evaluation script
  • eval_ttt.py — Standalone TTT evaluation script
  • submission.json — Structured results metadata
  • train.log — Complete training log
  • README.md — Detailed writeup with technique descriptions and ablations

See records/track_10min_16mb/2026-03-24_GPTQ_TTT/README.md for full details.

11-layer 512d GPT with PR#414 10-technique stack + LeakyReLU² activation,
post-training GPTQ int6 quantization, and SGD test-time training with
cosine LR decay. Artifact size: 15.75MB (under 16MB limit).

Techniques: XSA4, EMA, U-Net skip, SmearGate, BigramHash, PartialRoPE,
LNScale, ValueEmbed, LateQAT, SWA, LeakyReLU², GPTQ int6, SGD TTT.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — GPTQ Int6 + SGD Test-Time Training — A800 1.1190 bpb

BPB: 1.1190 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 1a4c3a9882b3, file records/track_10min_16mb/2026-03-24_GPTQ_TTT/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=67718 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.05s, dim=512, layers=11, vocab=1024, code=67718 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants