Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817
Non-Record: Crawler 3f+2cx2 d=832 + Mixed Int5 GPTQ + TTT — val_bpb 1.0903 (1-hour cluster)#1817Tonyy1977 wants to merge 2 commits intoopenai:mainfrom
Conversation
…0903 (1-hour cluster) 30-hour local run (1-hour 8xH100 cluster equivalent): - Pre-quant SWA: 1.0684 - GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1264 - Post-quant TTT freeze=1: 1.0903 Builds on PR openai#1579 (10-min track, 1.1372). 6x more training compute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…0910 (4-hour cluster validation) 120-hour local run (4-hour 8xH100 cluster equivalent), validates the PR openai#1817 recipe at 4x compute: - Pre-quant: 1.0526 - Pre-quant SWA: 1.0629 - GPTQ mixed-int (int5 flat-attn / int6 rest, no pruning): 1.1238 - Post-quant TTT freeze=1: 1.0910 - Artifact: 14.62 MB (1.25 MB headroom under budget) Key finding: 4x more training compute lands at essentially the same final TTT BPB (+0.0007 vs 1-hour cluster). Multi-epoch saturation + late warmdown is the bottleneck. The 1-hour cluster equivalent (PR openai#1817) is the right operating point for d=832 + this recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Follow-up: 4-hour cluster equivalent validation run Added a second submission folder (
Key findings:
Confirms d=832 with this recipe at 1-hour cluster equivalent is the right operating point - the 1.0903 result above is the primary claim. The 4-hour folder is a scaling/validation study showing where the recipe plateaus. Thank you OpenAI for the $500 grant approved on the last day of the challenge, unfortunately, there were no 8xH100 SXM instances available on RunPod to actually run the proper 4-hour cluster training. The 120-hour local equivalent above is the substitute. |
Non-Record: Crawler Transformer 3f+2cx2 d=832 — Mixed Int5 GPTQ + Post-Quant TTT — val_bpb 1.0903
val_bpb: 1.0903 | 15.96 MB | 1x RTX 6000 Ada 48GB, 30 hours (1-hour 8xH100 cluster equivalent)
Builds on PR #1579 (10-min track, val_bpb 1.1372). 6x more training compute → -0.047 BPB.
Result Summary
Comparison to PR #1579 (10-min track)
Architecture: Crawler Transformer
3 flat blocks + 2 crawler blocks × 2 loops = 7 effective depth. dim=832, 47.4M params. SP8192 tokenizer, BigramHash, SmearGate, VE, XSA all 7 layers.
Quantization (Mixed Int5/Int6)
Key Findings
Credits