Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Record: SP8192 + BigramHash d=32 + Path A v3 Aggressive Passthrough Quantization — val_bpb 1.07882 (3-seed mean)

**val_bpb = 1.07882** (3-seed mean, std 0.000143) | **mean 15,993,825 B (15.99 MB)** | 8× H100 80GB SXM | Legal Score-First TTT

Beats the merged SOTA ([2026-04-09 SP8192 record by @bigbag](https://github.com/openai/parameter-golf/blob/main/records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md), 3-seed mean 1.08100) by **−0.00218 bpb / −0.00564 nats per token** on a 3-seed mean, clearing the 0.005-nat record threshold with one-sided **z = −3.00, p = 0.00136** (p < 0.01 required).

## 3-Seed Results (8× H100 80GB SXM, PyTorch 2.9.1+cu128, Legal Score-First TTT)

### Core (TTT) table

| Seed | Steps | Pre-TTT sliding bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact (B) |
|---:|---:|---:|---:|---:|---:|---:|
| 42 | 4393 | 1.08015 | **1.07887** | −0.00128 | 336.1 s | 15,991,203 |
| 314 | 4393 | 1.08024 | **1.07893** | −0.00131 | 335.5 s | 15,994,170 |
| 999 | 4403 | 1.07998 | **1.07866** | −0.00132 | 333.6 s | 15,996,103 |
| **mean** | | **1.08012** | **1.07882** | **−0.00130** | **335.1 s** | **15,993,825** |
| **std** | | | **0.000143** | | | |

### Diagnostics

| Seed | Post-EMA bpb | Quant roundtrip bpb | Sliding bpb | TTT val_loss (nats) | Code bytes | Total submission (B) | Train ms | Eval ms (q+sl+ttt) |
|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| 42 | 1.08584 | 1.09678 | 1.08015 | 2.78662485 | 18,097 | 15,991,203 | 588,110 | 480,408 |
| 314 | 1.08580 | 1.09679 | 1.08024 | 2.78678778 | 18,097 | 15,994,170 | 588,031 | 479,495 |
| 999 | 1.08561 | 1.09662 | 1.07998 | 2.78608265 | 18,097 | 15,996,103 | 588,029 | 477,724 |
| **mean** | **1.08575** | **1.09673** | **1.08012** | **2.78650** | — | **15,993,825** | **588,057** | **479,209** |

## Key Innovation: Path A v3 Aggressive Passthrough Quantization

Two complementary changes on top of the [2026-04-09 SP8192 stack](../2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md):

### 1. `BIGRAM_DIM = 32`

BigramHashEmbedding dimension reduced from the common d=48/64 to **d=32**. Smaller bigram projection regularizes the hashed n-gram signal and frees ~262 KB of raw bigram parameters (compressed ~3 KB, modest on size but also lets `bigram.proj` be even smaller for the Path A v3 int8 treatment). Pre-quant post-EMA is preserved at ~1.0858, within noise of the d=48 baseline.

### 2. Path A v3 Aggressive Passthrough Quantization (primary contribution)

The canonical [PR #1394](https://github.com/openai/parameter-golf/pull/1394) / bigbag stack leaves the following tensors as **fp16 passthrough** (1 tensor per transformer block layer plus a few scalars), consuming ~40 KB in the compressed artifact:

- **Control tensors (per-tensor int8)**: `attn_scale`, `mlp_scale`, `resid_mix`, `skip_gates`, `skip_weights`. Each is a small 1-D array with a narrow dynamic range. Quantized to int8 with a single fp32 per-tensor scale — reconstruction error dominated by scale quantization is negligible (< 1e-4 relative on all tensors).
- **Small 2-D matrices (per-row int8)**: `bigram.proj` (512 × 32 = 16 K params), `attn_gate_proj`, `smear_gate.weight`. These are dense but small and excluded from Hessian-aware GPTQ by the `numel() <= 65536` threshold. Quantized to int8 with per-row fp16 scales.
- **`gptq_mixed_quantize`** and **`dequantize_mixed`** in the submitted training script are modified to dispatch these categories before falling back to fp16 passthrough. Everything else (int6 attn/MLP matrices, int8 tok_emb, int6 bigram.embed) is unchanged.
- **LZMA self-extracting wrapper** over a python-minified source: 53,508 raw → 52,775 minified → 14,384 LZMA → **18,097 bytes** wrapped. (Same technique as @bigbag's record.)

**Quantization quality cost:** measured to 5 d.p., the quantized roundtrip bpb is **unchanged** between baseline and Path A v3 (1.0968 in both). The Path A v3 modifications are effectively zero-cost in BPB while saving ~40 KB on the artifact.

**Net size effect:** total submission averages 15,993,825 B across 3 seeds (6,175 B under the 16,000,000 cap). Prior SP8192 runs without Path A v3 at the same `EMBED_BITS=8` configuration sit at ~16,065 KB (~65 KB over). Path A v3 is what makes int8 token-embeddings legal for this architecture.

## Architecture

11L × 512d × 8H / 4KV, MLP 4×, LeakyReLU(0.5)² activation, Partial RoPE (16 / 64 dims), tied token embeddings, logit softcap = 30. Skip gates (sigmoid-gated U-Net connections). Depth recurrence: encoder `[0,1,2,3,4,5,3,4]`, decoder `[5,3,4,5,6,7,8,9,10]` (loops layers 3–5, activated at step ~1950 = 45% training). Parallel residuals from layer 7. **BigramHashEmbedding with 16,384 buckets × d=32**. AttnOutputGate (width 12, source=proj). SmearGate (width 12). SentencePiece-BPE 8192.

## Training

MuonEq-R (row-normalized Muon, Newton-Schulz 5 steps) for matrices; AdamW for embeddings/scalars. Warmdown 72% of training; EMA decay 0.9965. QK-Gain init 5.0 (learnable per-head). Weight decay 0.085 / 0.095 (embed / matrix). ~4393–4403 steps in 588 s on 8× H100 SXM (`MAX_WALLCLOCK_SECONDS=600` minus 12 s GPTQ reserve).

## Quantization

Full-Hessian GPTQ with SDClip (`clip = k × std(row)`):

- **Matrices** (attn/MLP): int6, `matrix_clip_sigmas = 12.85` (@clarkkev PR #1394)
- **Token embeddings**: int8, `embed_clip_sigmas = 20.0`
- **bigram.embed**: int6 per-row simple scale
- **Path A v3 additions** (this PR): per-tensor int8 for control scalars, per-row int8 for small 2-D matrices (see Key Innovation section)

Byte-shuffle + Brotli-11 on the quantized state-dict blob. Self-extracting LZMA wrapper on the minified source.

## Test-Time Training (Score-First, Legal)

Per [Issue #1017](https://github.com/openai/parameter-golf/issues/1017) / [PR #549](https://github.com/openai/parameter-golf/pull/549) / [PR #461](https://github.com/openai/parameter-golf/pull/461) precedent:

```python
for chunk_idx, chunk_windows in enumerate(chunks):
# Phase 1: SCORE (under no_grad, no parameter update)
with torch.inference_mode():
nll = model.forward_logits(batch).cross_entropy(targets)
loss_sum += nll.sum()

# Phase 2: TRAIN (only on the chunk just scored)
if not is_last_chunk:
for _ in range(ttt_epochs): # 3 epochs
for x, y in chunk_seqs:
loss = model(x, y)
loss.backward()
optimizer.step() # SGD, lr=0.005, momentum=0.9
```

1,238 chunks × 32,768 tokens × 3 epochs. Strict score-before-update ordering; no token is ever trained on before it is scored. Mean TTT time 335 s per seed (well within 600 s eval budget).

## Rule Compliance

Per [repo README](../../../README.md) and [Issue #1017](https://github.com/openai/parameter-golf/issues/1017):

- **Condition 1 — Causality** ✅ Strictly causal forward pass. Sliding-window eval never references future tokens for current-position scoring.
- **Condition 2 — Normalized distribution** ✅ Standard softmax over full 8,192 vocab. No n-gram cache, no logit biasing, no multi-pass rescoring.
- **Condition 3 — Score before update** ✅ Every TTT chunk is scored under `inference_mode()` before any parameter update. Gradient updates only use already-scored tokens.
- **Condition 4 — Single pass** ✅ Each val token is scored exactly once. No rescoring, no cache lookups.

Additional:
- **No SLOT** (standard or causal) — no eval-time delta optimization
- **No pre-quant TTT** on val data — model is quantized once; TTT adapts the quantized model at eval time only
- **No ETLB** (eval-time logit bias)
- **No n-gram cache** or tilt
- **Seed choice conventional** — matches @bigbag 2026-04-09 exactly (42, 314, 999); no seed brute-forcing
- **Artifact < 16,000,000 bytes** on all 3 seeds (margins: 8,797 / 5,830 / 3,897 B)
- **Training ≤ 600 s** on all 3 seeds (588,029–588,110 ms actual)
- **Eval ≤ 600 s** on all 3 seeds (quantized + sliding + TTT = 477,724–480,408 ms)

## Statistical Evidence

Three independent seeds on a canonical 128-shard sp8192 tokenization of the `willdepueoai/parameter-golf` fineweb export:

```
Seed 42: val_bpb = 1.07886574, val_loss = 2.78662485 nats/token, total_bytes = 15,991,203, train_time_ms = 588,110
Seed 314: val_bpb = 1.07892882, val_loss = 2.78678778 nats/token, total_bytes = 15,994,170, train_time_ms = 588,031
Seed 999: val_bpb = 1.07865582, val_loss = 2.78608265 nats/token, total_bytes = 15,996,103, train_time_ms = 588,029

Mean bpb = 1.07881679
Std bpb = 0.000143 (sample, n=3, n-1=2)
SEM bpb = 0.0000826
Mean val_loss = 2.78649843 nats/token
bpb / val_loss ratio = 0.387159 (per-pod byte-count mapping)

Merged SOTA (bigbag 2026-04-09 3-seed mean) = 1.08100 bpb
Observed delta = 0.00218 bpb = 0.00564 nats/token (> 0.005-nat threshold)
Threshold in bpb at our ratio = 0.001936 bpb
Mean bpb required to clear threshold = 1.079064
Our mean bpb = 1.078817
Margin past threshold = 0.000247 bpb = 0.000637 nats/token

One-sided z (lower tail) = (1.078817 − 1.079064) / 0.0000826 = −2.998
One-sided p-value = 0.00136
Required: p < 0.01 → CLEARED
```

## Environment

```
torch 2.9.1+cu128
CUDA 12.8
NVIDIA driver 575.57.08
brotli 1.2.0
sentencepiece 0.2.1
python-minifier (latest)
NVIDIA H100 80 GB HBM3 SXM × 8 with NVLink (18 links × 26.562 GB/s)
NCCL all-reduce 256 MB: ~424 GB/s bus bandwidth (near-peak NVLink4)
```

## Reproduction

```bash
# 1. Install deps
pip install --break-system-packages brotli python-minifier sentencepiece huggingface_hub

# 2. Clone competition repo + generate canonical sp8192 data
git clone https://github.com/openai/parameter-golf.git repo
cd repo

cat > data/tokenizer_specs_sp8192.json <<'EOF'
{"tokenizers":[{"name":"sp_bpe_8192","dataset_suffix":"sp8192","vocab_size":8192}]}
EOF

python3 data/download_hf_docs_and_tokenize.py \
--repo-id willdepueoai/parameter-golf \
--remote-root datasets \
--output-root ./data \
--tokenizer-config data/tokenizer_specs_sp8192.json \
--skip-byte \
--chunk-tokens 100000000 \
--tokenizer-train-docs 1000000

# 3. Run 3 seeds
for SEED in 42 314 999; do
SEED=$SEED DATA_DIR=./data/ RUN_ID=seed${SEED} \
ITERATIONS=20000 MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=1 SLIDING_WINDOW_ENABLED=1 VAL_LOSS_EVERY=4000 \
BIGRAM_VOCAB_SIZE=16384 BIGRAM_DIM=32 \
GATE_ATTN_OUT=1 GATE_WIDTH=12 GATE_ATTN_SRC=proj \
SMEAR_GATE=1 SMEAR_GATE_WIDTH=12 \
EMBED_BITS=8 EMBED_CLIP_SIGMAS=20.0 COMPRESSOR=brotli \
torchrun --standalone --nproc_per_node=8 train_gpt.py \
2>&1 | tee logs/train_seed${SEED}.log
done
```

The provided `train_gpt.py` is an 18,097-byte LZMA self-extracting wrapper. The equivalent full source (53,508 B) is `train_gpt_stacked_v2_fixed.py` for review.

## Credits

- **@clarkkev** — PR #1394: SP8192 base stack + GPTQ SDClip + int6 matrices / int8 embeddings + MuonEq-R + SP8192 tokenizer recipe.
- **@bigbag** — 2026-04-09 SP8192 record: 3-layer depth recurrence + parallel residuals + QK-Gain 5.25 + legal TTT on the SP8192 stack. (Direct ancestor of this submission.)
- **@dexhunter** — PR #1331, #1437: 3-layer depth recurrence; PR #1413: legal TTT on SP8192.
- **@Robby955** — PR #1412: parallel residuals on SP8192. **@msisovic** — PR #1204: parallel residuals concept.
- **@Christopher-Lee-McClendon** — PR #461: legal score-first TTT framework. **@abaybektursun** — PR #549: merged precedent for legal TTT.
- **@MarioPaerle** — PR #1667: AttnOutputGate used in this architecture.

## Our contribution

Two modifications on top of the @bigbag / @clarkkev SP8192 lineage:

1. **Path A v3 aggressive passthrough quantization** in `gptq_mixed_quantize` and `dequantize_mixed` — per-tensor int8 for five control-tensor families (`attn_scale`, `mlp_scale`, `resid_mix`, `skip_gates`, `skip_weights`) and per-row int8 for three small 2-D matrices (`bigram.proj`, `attn_gate_proj`, `smear_gate.weight`). Net effect: the full bigbag-style int8 token-embedding + int6 matrix recipe now fits ≤ 16 MB with ~6 KB margin, preserving the full TTT BPB of the baseline.
2. **BigramHashEmbedding `d = 32`** (vs common d=48 / d=64 in the lineage) — modest regularization + complementary size savings that free a few KB for Path A v3 to work with.
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
{
"author": "himanshudongre",
"github_id": "himanshudongre",
"name": "SP8192 + BigramHash d=32 + AttnOutputGate + SmearGate + Path A v3 Aggressive Passthrough Quantization + Legal Score-First TTT",
"date": "2026-04-18",
"track": "10min_16mb",
"val_bpb": 1.07882,
"val_bpb_std": 0.000143,
"seeds": [
42,
314,
999
],
"seed_results": {
"42": {
"val_bpb": 1.07886574,
"val_loss": 2.78662485,
"sliding_val_bpb": 1.08014601,
"quantized_val_bpb": 1.09678081,
"pre_quant_post_ema_val_bpb": 1.08584188,
"artifact_bytes": 15991203,
"train_time_ms": 588110,
"ttt_time_ms": 336109,
"sliding_time_ms": 120470,
"quantized_time_ms": 23829,
"eval_time_ms_total": 480408
},
"314": {
"val_bpb": 1.07892882,
"val_loss": 2.78678778,
"sliding_val_bpb": 1.08023616,
"quantized_val_bpb": 1.09679203,
"pre_quant_post_ema_val_bpb": 1.08579894,
"artifact_bytes": 15994170,
"train_time_ms": 588031,
"ttt_time_ms": 335468,
"sliding_time_ms": 119913,
"quantized_time_ms": 24114,
"eval_time_ms_total": 479495
},
"999": {
"val_bpb": 1.07865582,
"val_loss": 2.78608265,
"sliding_val_bpb": 1.07998003,
"quantized_val_bpb": 1.09662297,
"pre_quant_post_ema_val_bpb": 1.08561033,
"artifact_bytes": 15996103,
"train_time_ms": 588029,
"ttt_time_ms": 333575,
"sliding_time_ms": 120053,
"quantized_time_ms": 24096,
"eval_time_ms_total": 477724
}
},
"mean_bpb": 1.07881679,
"std_bpb": 0.00014293,
"sliding_bpb_mean": 1.08012073,
"quantized_bpb_mean": 1.09673194,
"pre_quant_post_ema_bpb_mean": 1.08575038,
"mean_val_loss_nats": 2.78649843,
"vs_merged_sota": {
"merged_sota_bpb": 1.081,
"delta_bpb": 0.002183,
"delta_nats_per_token": 0.005639,
"threshold_nats": 0.005,
"threshold_bpb_at_our_ratio": 0.001936,
"mean_bpb_required_to_clear_threshold": 1.079064,
"one_sided_z_statistic": -2.9982,
"one_sided_p_value": 0.001358,
"cleared_threshold_at_p_lt_0_01": true,
"merged_sota_source": "records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/ (by @bigbag)"
},
"tokenizer": "SentencePiece BPE 8192 (trained from 1M canonical fineweb docs)",
"architecture": "11L/512d/8H/4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE 16d, Depth-recurrence (loop layers 3-5 from frac=0.35), Parallel residuals (layer 7+), QK-Gain 5.0, Skip gates, AttnOutputGate (width 12), SmearGate (width 12), BigramHashEmbedding (16384 buckets x d=32), tied embeddings, logit softcap 30",
"platform": "RunPod 8xH100 80GB SXM, PyTorch 2.9.1+cu128, CUDA 12.8",
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP8192 + BigramHash d=32 + AttnOutputGate + SmearGate + Depth recurrence (L3-5) + Parallel residuals (L7+) + QK-Gain 5.0 + MuonEq-R + SDClip GPTQ + Path A v3 aggressive int8 passthrough (control tensors + small matrices) + Legal Score-First TTT + LZMA code pack + Brotli",
"compliance": {
"artifact_under_16mb": true,
"training_under_600s": true,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
},
"attribution": {
"sp8192_gptq_embeds_sdclip_muoneqr_depth_recur": "@clarkkev (PR #1394)",
"three_layer_depth_recurrence": "@dexhunter (PR #1331, #1437)",
"parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
"qk_gain": "@clarkkev (PR #1394 default 4.0; raised to 5.0 as a tune on top)",
"legal_ttt_framework": "@Christopher-Lee-McClendon (PR #461), @abaybektursun (PR #549), @dexhunter (PR #1413)",
"attn_output_gate": "@MarioPaerle (PR #1667)",
"bigbag_base_sp8192_record": "@bigbag, records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/ (3-seed mean 1.08100)"
},
"our_contribution": "Two changes on top of @bigbag 2026-04-09's SP8192 stack: (1) BigramHashEmbedding dimension d=32 (vs 48); (2) Path A v3 aggressive passthrough quantization (int8 per-tensor for control scalars + int8 per-row for small 2-D matrices that were fp16 passthrough) + LZMA self-extracting code wrapper. Net artifact savings ~40 KB + 35 KB, enabling int8 token embeddings to fit under 16 MB with ~6 KB margin while preserving the full BPB of the baseline recipe. 3-seed mean 1.07882 (std 0.000143) clears the 0.005-nat threshold at p=0.0018."
}

Large diffs are not rendered by default.

Loading