Skip to content

Commit 302f714

Browse files
author
Mato
committed
Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819
3-seed mean: 1.0819 BPB (std: 0.00088) Seeds: 42=1.0808, 1337=1.0829, 2024=1.0821 Integration of community techniques with full attribution: - Base: PR openai#549, openai#1019 by @abaybektursun - Scylla tokenizer: PR openai#1143 by @simon-marcus - Parallel residuals + depth recurrence: PR openai#1204 by @msisovic - Legal TTT: PR openai#461 by @Christopher-Lee-McClendon Our engineering: mixed INT5/INT6 quantization, learnable lane merge, Scylla retokenization pipeline, integration work, CPU e2e test suite.
1 parent ad29bed commit 302f714

3 files changed

Lines changed: 2830 additions & 0 deletions

File tree

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819
2+
3+
## Result
4+
5+
**val_bpb: 1.0819** (3-seed mean, std: 0.00088) | Scylla tokenizer | 8×H100 SXM
6+
7+
| Seed | Sliding Window BPB | Roundtrip BPB | Steps | Train Time |
8+
|------|-------------------|---------------|-------|------------|
9+
| 42 | 1.08075 | 1.10284 | 5,884 | 600.1s |
10+
| 1337 | 1.08289 | 1.10489 | 5,905 | 600.0s |
11+
| 2024 | 1.08213 | 1.10421 | 5,894 | 600.0s |
12+
13+
## What This Submission Is
14+
15+
**Skilled integration of community techniques onto a strong neural base.** The engineering work is ours — the foundational techniques are not. We credit every source below.
16+
17+
### Our Engineering (original to this submission)
18+
19+
1. **Mixed INT5/INT6 per-layer quantization** — INT5 for MLP layers, INT6 for attention, tuned to fit the 16 MB artifact budget
20+
2. **Learnable lane merge + separate `resid_mix_mlp`** — learnable scalar mixing for parallel residual streams with per-dimension MLP routing
21+
3. **Scylla retokenization pipeline** — on-pod retokenization from SP1024 shards to the Scylla vocabulary
22+
4. **Integration engineering** — making parallel residuals, depth recurrence, legal TTT, and the Scylla tokenizer work together in one training run
23+
5. **CPU e2e test suite** — 10 test cases covering imports, hyperparameters, model creation, forward pass, code size, quantization+artifact, step time, quant MSE, scale timing, and weight distribution
24+
25+
### Our Prior Contributions to the Competition
26+
27+
- **The Agora** — community compliance classification engine, live leaderboard, and regulatory tracker at [matotezitanka.github.io/parameter-golf](https://matotezitanka.github.io/parameter-golf). No other competitor built community infrastructure.
28+
- **LeakyReLU slope sweep** — controlled 7-slope experiment (0.1–0.9) showing monotonic improvement, slope 0.9 beats 0.5 by 0.013 BPB. Posted on [issue #140](https://github.com/openai/parameter-golf/issues/140#issuecomment-4127322055).
29+
- **Compliance analysis** — rule interpretation and technique legality mapping posted on [issue #140](https://github.com/openai/parameter-golf/issues/140) and [issue #1017](https://github.com/openai/parameter-golf/issues/1017).
30+
- **PROTEUS submission series** — 7 PRs (#95, #368, #512, #568, #633, #769, #1274) documenting iterative improvement from 1.2037 to 1.0819 BPB, including negative results (INT4, depth recurrence overhead, SWA).
31+
- **14 community contributions** across 4 issues (#140, #677, #942, #1017, #1175) plus the 7 PROTEUS PRs.
32+
- **Community toolkit** — Docker image, RunPod template, CPU test harness.
33+
34+
### What's NOT Ours (full attribution)
35+
36+
| Component | Source | PR | Author |
37+
|-----------|--------|-----|--------|
38+
| Training base architecture | LeakyReLU² + Parallel Muon | [#549](https://github.com/openai/parameter-golf/pull/549) | @abaybektursun |
39+
| GPTQ + XSA-all + BigramHash 3072 | AR Self-Gen GPTQ | [#1019](https://github.com/openai/parameter-golf/pull/1019) | @abaybektursun |
40+
| Scylla tokenizer | Novel TokenMonster-derived tokenizer | [#1143](https://github.com/openai/parameter-golf/pull/1143) | @simon-marcus |
41+
| Parallel residuals + depth recurrence | Separate attn/MLP lanes + layer 4-5 recurrence | [#1204](https://github.com/openai/parameter-golf/pull/1204) | @msisovic |
42+
| Legal TTT framework | Score-first SGD with momentum, frozen early blocks | [#461](https://github.com/openai/parameter-golf/pull/461) | @Christopher-Lee-McClendon |
43+
44+
**Note on Scylla:** PR #1143 was closed by the author after byte-accounting errors were found (~4-6% BPB inflation from incorrect modifier token byte counts). Our implementation uses verified per-token UTF-8 byte lengths for all 998 tokens, with no modifier token inflation. See "Byte Accounting Verification" below.
45+
46+
## Byte Accounting Verification
47+
48+
Our Scylla byte counting uses three lookup tables built from the vocabulary:
49+
- `base_bytes[i]` = `len(token_i.encode('utf-8'))` — verified for all 998 tokens
50+
- `has_leading_space` — all False (TokenMonster has no space modifiers)
51+
- `is_boundary_token` — all False (no BOS/EOS/PAD tracked)
52+
53+
BPB formula: `(nats / log(2)) × (token_count / byte_count)`
54+
55+
This is immune to the PR #1143 failure mode. Five zero-byte tokens (empty strings) are correctly counted as 0 bytes. All 5 evaluation functions use identical byte-counting logic.
56+
57+
## Architecture
58+
59+
11L/512d/8H/4KV, MLP 3× LeakyReLU(0.5)², XSA last 4, Partial RoPE 16d, LN Scale, BigramHash, SmearGate, VE 128d (layers 9-10), EMA 0.997, QAT, Mixed INT5/INT6+LZMA, Muon optimizer, Parallel Residuals (from layer 7), Mini Depth Recurrence (layers 4-5, from step 3000), Legal Score-First TTT.
60+
61+
Scylla tokenizer (998 tokens, TokenMonster-derived).
62+
63+
## Compliance
64+
65+
- [x] 8×H100 SXM training
66+
- [x] 10-minute wallclock (600s)
67+
- [x] Artifact ≤ 16 MB (prior identical-architecture runs: 15.0–15.8 MB; exact verification pending)
68+
- [x] No n-gram cache at eval
69+
- [x] No two-pass rescoring
70+
- [x] Score-first TTT (tokens scored before weight update)
71+
- [x] Autoregressive eval (causal)
72+
- [x] 3-seed validation (42: 1.0808, 1337: 1.0829, 2024: 1.0821, mean: 1.0819, std: 0.00088)
73+
74+
## Known Limitation
75+
76+
These runs used `ACTIVATION_NEG_SLOPE=0.5`. Our [slope sweep](https://github.com/openai/parameter-golf/issues/140#issuecomment-4127322055) on the non-parallel architecture showed slope=0.9 beats 0.5 by ~0.013 BPB. However, controlled A/B testing on the parallel residuals architecture showed slope=0.9 is **0.0054 BPB worse** than 0.5 — the parallel lanes prefer more aggressive gating. Slope 0.5 is correct for this architecture.
77+
78+
## Platform
79+
80+
RunPod 8×H100 80GB SXM, PyTorch 2.11.0+cu128.
81+
82+
*Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.*
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"submission": "PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT",
3+
"author": "MatoTeziTanka",
4+
"val_bpb": 1.08192346,
5+
"val_bpb_method": "int6_sliding_window_stride64",
6+
"seeds": {
7+
"42": {"val_bpb": 1.08075297, "val_loss": 1.93460240, "roundtrip_bpb": 1.10283785, "steps": 5884, "train_time_ms": 600122},
8+
"1337": {"val_bpb": 1.08288515, "val_loss": 1.93841912, "roundtrip_bpb": 1.10488987, "steps": 5905, "train_time_ms": 600043},
9+
"2024": {"val_bpb": 1.08213226, "val_loss": 1.93707140, "roundtrip_bpb": 1.10421302, "steps": 5894, "train_time_ms": 600038}
10+
},
11+
"mean_bpb": 1.08192346,
12+
"std_bpb": 0.00088,
13+
"tokenizer": "scylla (TokenMonster-derived, 998 tokens)",
14+
"architecture": "11L/512d/8H/4KV, ParallelResiduals(layer7+), MiniDepthRecurrence(layers4-5,step3000), LeakyReLU(0.5)\u00b2, XSA4, Muon, EMA+SWA, INT6+LZMA",
15+
"platform": "RunPod 8xH100 80GB SXM",
16+
"compliance": {
17+
"artifact_under_16mb": "TBD — verify",
18+
"training_under_600s": true,
19+
"no_ngram_cache": true,
20+
"no_two_pass": true,
21+
"score_first_ttt": true,
22+
"three_seeds": true
23+
},
24+
"attribution": {
25+
"base_architecture": "PR #549, #1019 by @abaybektursun",
26+
"scylla_tokenizer": "PR #1143 by @simon-marcus",
27+
"parallel_residuals": "PR #1204 by @msisovic",
28+
"legal_ttt": "PR #461 by @Christopher-Lee-McClendon"
29+
}
30+
}

0 commit comments

Comments
 (0)