Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster) by Tonyy1977 · Pull Request #1817 · openai/parameter-golf

Tonyy1977 · 2026-04-25T15:46:15Z

Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903

val_bpb: 1.0903 | 15.96 MB | 1x RTX 6000 Ada 48GB, 30 hours (1-hour 8xH100 cluster equivalent)

Builds on PR #1579 (10-min track, val_bpb 1.1372). 6x more training compute → -0.047 BPB.

Result Summary

Stage	val_bpb
Pre-quant SWA	1.0684
int8+SDClip roundtrip	1.1381
GPTQ mixed-int (int5 flat-attn / int6 rest) roundtrip	1.1264
Post-quant TTT (freeze=1) on GPTQ artifact	1.0903

Steps: 30,374 (stopped by 30-hour wallclock cap)
Artifact: 15,867,420 bytes (15.96 MB), zero pruning
Total: 15,959,106 bytes (under 16 MB budget)

Comparison to PR #1579 (10-min track)

Config	Steps	Pre-quant	TTT BPB	Hardware
d=736 int6 (PR #1579)	6,042	1.1232	1.1372	10-min cluster
d=832 int5-flat (this)	30,374	1.0684	1.0903	1-hour cluster

Architecture: Crawler Transformer

3 flat blocks + 2 crawler blocks × 2 loops = 7 effective depth. dim=832, 47.4M params. SP8192 tokenizer, BigramHash, SmearGate, VE, XSA all 7 layers.

Quantization (Mixed Int5/Int6)

int5 for flat-block attention only (12 matrices)
int6 for everything else (22 matrices: flat MLP + all crawler)
int8 for embeddings
SDClip + GPTQ + Brotli, zero pruning (fits naturally at 15.96 MB)

Key Findings

Mixed-int beats pruning: Standard int6 needs 13.5% pruning at d=832 (roundtrip 1.1664). Mixed int5/int6 fits naturally with no pruning (roundtrip 1.1264).
Int5 attention is robust, int5 MLP is not: Quantizing only flat attention to int5 saves space without significant quality loss.
Pre-quant matters most: 6x more training compute → 0.041 BPB improvement at SWA, carrying through quantization and TTT.

Credits

Crawler Transformer architecture: inspired by @newjordan's research (PR Recursive Transformer - Non-Record Submission — 1.07424983 val_bpb (4h depth-recurrent hybrid transformer run) #1535)
Mixed-int quantization (int5 attn / int6 MLP): inspired by @newjordan's Midnight 12L (PR Midnight 12L — 1.10567949 val_bpb (seed 444) #1458)
See PR Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372 #1579 for full credits stack

…0903 (1-hour cluster) 30-hour local run (1-hour 8xH100 cluster equivalent): - Pre-quant SWA: 1.0684 - GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1264 - Post-quant TTT freeze=1: 1.0903 Builds on PR openai#1579 (10-min track, 1.1372). 6x more training compute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…0910 (4-hour cluster validation) 120-hour local run (4-hour 8xH100 cluster equivalent), validates the PR openai#1817 recipe at 4x compute: - Pre-quant: 1.0526 - Pre-quant SWA: 1.0629 - GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1238 - Post-quant TTT freeze=1: 1.0910 - Artifact: 14.62 MB (1.25 MB headroom under budget) Key finding: 4x more training compute lands at essentially the same final TTT BPB (+0.0007 vs 1-hour cluster). Multi-epoch saturation + late warmdown is the bottleneck. The 1-hour cluster equivalent (PR openai#1817) is the right operating point for d=832 + this recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tonyy1977 · 2026-05-01T00:30:24Z

Follow-up: 4-hour cluster equivalent validation run

Added a second submission folder (records/track_non_record_16mb/2026-04-30_CrawlerTransformer_d832_4hrCluster_MixedInt5_TTT/) that documents a 120-hour local run (4-hour 8xH100 cluster equivalent) using the same recipe as this PR's main submission.

Stage	1-hr cluster (this PR)	4-hr cluster (validation)
Pre-quant	1.0674	1.0526
GPTQ roundtrip	1.1264	1.1238
TTT final	1.0903	1.0910
Artifact size	15.96 MB	14.62 MB
Steps	30,374	122,832

Key findings:

4x more training compute → +0.0007 BPB regression on final TTT (essentially flat)
Pre-quant + GPTQ both improve marginally; the gain is absorbed by less effective TTT recovery
Multi-epoch saturation (~6 passes through fineweb10B) + warmdown_frac=0.6 starting too late is the bottleneck on the longer run
Brotli compresses the longer-trained model 8% better → 1.25 MB of unused budget headroom, suggesting d=896 with the same scheme would fit on a fresh budget run

Confirms d=832 with this recipe at 1-hour cluster equivalent is the right operating point - the 1.0903 result above is the primary claim. The 4-hour folder is a scaling/validation study showing where the recipe plateaus.

Thank you OpenAI for the $500 grant approved on the last day of the challenge, unfortunately, there were no 8xH100 SXM instances available on RunPod to actually run the proper 4-hour cluster training. The 120-hour local equivalent above is the substitute.

Tonyy1977 and others added 2 commits April 25, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817
Tonyy1977 wants to merge 2 commits intoopenai:mainfrom
Tonyy1977:crawler-d832-1hr-mixedint5

Tonyy1977 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Tonyy1977 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Tonyy1977 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903

Result Summary

Comparison to PR #1579 (10-min track)

Architecture: Crawler Transformer

Quantization (Mixed Int5/Int6)

Key Findings

Credits

Uh oh!

Tonyy1977 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tonyy1977 commented Apr 25, 2026 •

edited

Loading

Tonyy1977 commented May 1, 2026 •

edited

Loading