Skip to content

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817

Open
Tonyy1977 wants to merge 2 commits intoopenai:mainfrom
Tonyy1977:crawler-d832-1hr-mixedint5
Open

Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817
Tonyy1977 wants to merge 2 commits intoopenai:mainfrom
Tonyy1977:crawler-d832-1hr-mixedint5

Conversation

@Tonyy1977
Copy link
Copy Markdown

@Tonyy1977 Tonyy1977 commented Apr 25, 2026

Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903

val_bpb: 1.0903 | 15.96 MB | 1x RTX 6000 Ada 48GB, 30 hours (1-hour 8xH100 cluster equivalent)

Builds on PR #1579 (10-min track, val_bpb 1.1372). 6x more training compute → -0.047 BPB.

Result Summary

Stage val_bpb
Pre-quant SWA 1.0684
int8+SDClip roundtrip 1.1381
GPTQ mixed-int (int5 flat-attn / int6 rest) roundtrip 1.1264
Post-quant TTT (freeze=1) on GPTQ artifact 1.0903
  • Steps: 30,374 (stopped by 30-hour wallclock cap)
  • Artifact: 15,867,420 bytes (15.96 MB), zero pruning
  • Total: 15,959,106 bytes (under 16 MB budget)

Comparison to PR #1579 (10-min track)

Config Steps Pre-quant TTT BPB Hardware
d=736 int6 (PR #1579) 6,042 1.1232 1.1372 10-min cluster
d=832 int5-flat (this) 30,374 1.0684 1.0903 1-hour cluster

Architecture: Crawler Transformer

3 flat blocks + 2 crawler blocks × 2 loops = 7 effective depth. dim=832, 47.4M params. SP8192 tokenizer, BigramHash, SmearGate, VE, XSA all 7 layers.

Quantization (Mixed Int5/Int6)

  • int5 for flat-block attention only (12 matrices)
  • int6 for everything else (22 matrices: flat MLP + all crawler)
  • int8 for embeddings
  • SDClip + GPTQ + Brotli, zero pruning (fits naturally at 15.96 MB)

Key Findings

  1. Mixed-int beats pruning: Standard int6 needs 13.5% pruning at d=832 (roundtrip 1.1664). Mixed int5/int6 fits naturally with no pruning (roundtrip 1.1264).
  2. Int5 attention is robust, int5 MLP is not: Quantizing only flat attention to int5 saves space without significant quality loss.
  3. Pre-quant matters most: 6x more training compute → 0.041 BPB improvement at SWA, carrying through quantization and TTT.

Credits

Tonyy1977 and others added 2 commits April 25, 2026 11:44
…0903 (1-hour cluster)

30-hour local run (1-hour 8xH100 cluster equivalent):
- Pre-quant SWA: 1.0684
- GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1264
- Post-quant TTT freeze=1: 1.0903

Builds on PR openai#1579 (10-min track, 1.1372). 6x more training compute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…0910 (4-hour cluster validation)

120-hour local run (4-hour 8xH100 cluster equivalent), validates the PR openai#1817 recipe at 4x compute:
- Pre-quant: 1.0526
- Pre-quant SWA: 1.0629
- GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1238
- Post-quant TTT freeze=1: 1.0910
- Artifact: 14.62 MB (1.25 MB headroom under budget)

Key finding: 4x more training compute lands at essentially the same final TTT BPB
(+0.0007 vs 1-hour cluster). Multi-epoch saturation + late warmdown is the bottleneck.
The 1-hour cluster equivalent (PR openai#1817) is the right operating point for d=832 + this recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Tonyy1977
Copy link
Copy Markdown
Author

Tonyy1977 commented May 1, 2026

Follow-up: 4-hour cluster equivalent validation run

Added a second submission folder (records/track_non_record_16mb/2026-04-30_CrawlerTransformer_d832_4hrCluster_MixedInt5_TTT/) that documents a 120-hour local run (4-hour 8xH100 cluster equivalent) using the same recipe as this PR's main submission.

Stage 1-hr cluster (this PR) 4-hr cluster (validation)
Pre-quant 1.0674 1.0526
GPTQ roundtrip 1.1264 1.1238
TTT final 1.0903 1.0910
Artifact size 15.96 MB 14.62 MB
Steps 30,374 122,832

Key findings:

  • 4x more training compute → +0.0007 BPB regression on final TTT (essentially flat)
  • Pre-quant + GPTQ both improve marginally; the gain is absorbed by less effective TTT recovery
  • Multi-epoch saturation (~6 passes through fineweb10B) + warmdown_frac=0.6 starting too late is the bottleneck on the longer run
  • Brotli compresses the longer-trained model 8% better → 1.25 MB of unused budget headroom, suggesting d=896 with the same scheme would fit on a fresh budget run

Confirms d=832 with this recipe at 1-hour cluster equivalent is the right operating point - the 1.0903 result above is the primary claim. The 4-hour folder is a scaling/validation study showing where the recipe plateaus.

Thank you OpenAI for the $500 grant approved on the last day of the challenge, unfortunately, there were no 8xH100 SXM instances available on RunPod to actually run the proper 4-hour cluster training. The 120-hour local equivalent above is the substitute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant