openai · dentity007 · Mar 18, 2026 · Mar 18, 2026 · Apr 22, 2026
diff --git a/APPROACH.md b/APPROACH.md
@@ -0,0 +1,93 @@
+# Parameter Golf — Approach Notes
+
+## Strategy Overview
+
+Maximize language model quality within a 16MB artifact constraint and 10 minutes on 8×H100s. Five pillars informed by research in model compression, efficient architectures, and training optimization.
+
+---
+
+## 1. Depth Recurrence (Layer Sharing)
+
+Instead of unique parameters per layer, reuse a small set of transformer blocks recursively. A 4-block recursive model with 8 passes achieves the effective depth of a 32-layer network while only storing 4 layers of parameters.
+
+Research shows recursive transformers achieve comparable loss to standard architectures with 3-4× fewer parameters. The model learns to refine representations through repeated application of the same weights — a form of iterative refinement that naturally suits the extreme parameter constraint.
+
+**Target:** Replace 12 unique layers with 4 recursive blocks × 3 passes = 12 effective layers at 1/3 the parameter cost.
+
+## 2. Factorized Embeddings
+
+The embedding matrix is often the largest single component. Instead of a full V×H matrix, decompose it into V×E and E×H where E << H. This technique (from ALBERT) can reduce embedding parameters by 80%+ while maintaining representation quality.
+
+Combined with tied input/output embeddings, this eliminates the output projection layer entirely — the same factorized embedding serves both input and output.
+
+**Math:** At vocab 1024, hidden 512: Full = 524K params. Factorized (E=128): 131K + 65K = 196K params. Savings: 63%.
+
+## 3. Quantization-Aware Training (QAT)
+
+Train the model knowing it will be quantized. The model learns weight distributions that survive low-precision conversion. At 2-bit precision, 16MB supports ~32M parameters.
+
+Key insight: post-training quantization at 2-bit loses 15-20% quality. QAT at 2-bit loses only ~4%. The difference is massive at this scale.
+
+**Approach:** Train at FP16/BF16, apply QAT during training with straight-through estimators, export at 2-bit for the final artifact.
+
+## 4. Knowledge Distillation
+
+Use a larger pretrained model as a teacher during training. The 8×H100 budget can run a 7B teacher alongside a 32M student. The student learns from soft probability distributions rather than hard labels, capturing more knowledge per training step.
+
+Distillation is especially powerful for small models — the teacher provides a richer gradient signal than raw cross-entropy on token predictions alone.
+
+## 5. Training Maximization
+
+Every second of the 10-minute budget matters:
+
+- **Sequence packing:** Multiple short examples per input sequence, no wasted padding tokens
+- **Curriculum ordering:** Train on FineWeb examples ordered by difficulty (shorter/simpler first, longer/complex later) for faster initial convergence
+- **Cosine LR schedule:** High initial learning rate with cosine decay over the 10-minute window
+- **Gradient accumulation:** Effective batch size tuned for optimal loss curves on H100s
+- **Mixed precision training:** BF16 compute for speed, QAT checkpoints for artifact size
+
+## 6. Tokenizer Optimization
+
+Vocabulary size directly impacts embedding parameter count. The baseline uses 1024 tokens. Exploring:
+
+- Smaller BPE vocabularies (512, 256) — fewer embedding parameters but worse compression
+- The tradeoff is parameter cost vs bytes-per-token — the evaluation metric is bits per byte, so better compression from larger vocab can offset the parameter cost
+- Custom tokenizer trained specifically on FineWeb distribution
+
+## 7. Alternative Architectures
+
+Beyond standard transformers:
+
+- **State-space models (Mamba-style):** Linear scaling with sequence length, potentially more parameter-efficient for the same quality
+- **Mixture of Experts at micro-scale:** Multiple tiny FFN experts with a router — only a subset active per token, more capacity per parameter
+- **Depth-adaptive inference:** Early exit for easy tokens, full depth for hard ones — maximizes quality where it matters most
+
+---
+
+## The Math
+
+| Bitwidth | Parameters in 16MB | Architecture |
+|----------|-------------------|-------------|
+| 2-bit | ~32M | Recursive transformer, factorized embeddings |
+| 3-bit | ~21M | Standard transformer, tied embeddings |
+| 4-bit | ~16M | Compact transformer |
+
+## Experiment Plan
+
+- [ ] Run baseline (9-layer, 512-dim, 1024-vocab, tied embeddings) — establish score to beat (1.2244)
+- [ ] Implement depth recurrence (4 recursive blocks × 3 passes)
+- [ ] Add factorized embeddings (V×128 + 128×H)
+- [ ] Test 2-bit QAT during training
+- [ ] Knowledge distillation with 7B teacher
+- [ ] Curriculum data ordering on FineWeb
+- [ ] Tokenizer vocabulary sweep (256, 512, 1024, 2048)
+- [ ] Mamba/SSM architecture comparison
+- [ ] Combine best techniques into final submission
+
+## Background
+
+5 production fine-tuned models (7B-72B) deployed via QLoRA/GGUF/NVFP4 quantization on NVIDIA DGX hardware. Built a 130K-chunk expert knowledge base for AI/ML research consultation. Deep experience with compression-quality tradeoffs across bitwidths.
+
+## Status
+
+Credits requested. Local experimentation with MLX baseline in progress.
diff --git a/records/track_10min_16mb/2026-04-22_SP8192_NoGates_MPSGD_3phase_1.0729/README.md b/records/track_10min_16mb/2026-04-22_SP8192_NoGates_MPSGD_3phase_1.0729/README.md
@@ -0,0 +1,84 @@
+# SP8192 + No Gates + Multi-Phase Global SGD TTT
+
+**val_bpb: 1.07285** (3-seed mean, std 0.00051) | **~15.94 MB** | 8xH100 SXM | Multi-Phase Global SGD TTT (Track B)
+
+This record combines the base architecture from PR #1667 (MarioPaerle) with the Multi-Phase Global SGD TTT path from PR #1626 (dexhunter), with both SmearGate and AttnOutGate disabled. No tokenizer changes (vanilla SP8192). No Casefold or CaseOps. No SLOT.
+
+## Results (8xH100 80GB SXM, Kansas City US, PyTorch 2.9.1+cu128, FA3)
+
+| Seed | Steps | Train time | Post-TTT val_bpb | Post-TTT val_loss | Eval time | Artifact (bytes) |
+|------|------:|-----------:|-----------------:|------------------:|----------:|-----------------:|
+| 1337 | 4827  | 587.52s    | 1.07333739       | 2.77254196        | 429.1s    | 15,935,536       |
+| 42   | 4839  | 587.16s    | 1.07287895       | 2.77135776        | 338.7s    | 15,935,501       |
+| 0    | 4832  | 587.16s    | 1.07232205       | 2.76991921        | 385.1s    | 15,943,766       |
+| **Mean** | **4833** | **587.28s** | **1.07285** | **2.77127** | **384.3s** | **15,938,268** |
+| **Std**  |          |           | **0.00051**      | **0.00131**       |            | 4,805            |
+
+All three seeds clear the 600s train budget, the 600s eval budget, and the 16,000,000-byte decimal artifact cap. The 3-seed std of 0.00051 BPB is well inside the 0.005-nat significance floor.
+
+## What this submission is
+
+This is a disciplined combinatorial submission that establishes two data points at full 8xH100 production scale:
+
+1. **MP-SGD 3-phase TTT beats single-phase score-first TTT by 0.0028 BPB** on the same base architecture (single-phase run on the same pod produced 1.07612, this run produced 1.07334 for seed 1337).
+2. **Disabling SmearGate and AttnOutGate from PR #1667's base does not hurt this configuration.** Reasoning for this came from community observations that PR #1736 and PR #1756 shipped with both gates plumbed but flagged off in their winning runs; I validated the direction on Spark ablations first, then reproduced at H100 production scale.
+
+It does not attempt a novel architecture. It isolates a specific hypothesis (MP-SGD over single-phase TTT) and answers it at full scale.
+
+## Lineage / attribution
+
+- **PR #1667 @MarioPaerle** — SP8192 base architecture, 11L x 512d x 8H / 4KV, Partial RoPE 16/64, Loop L3-5, Parallel Residuals L7+, QK-Gain 5.25, MuonEq-R optimizer, Skip gates, SmearGate and AttnOutGate (both disabled in this submission), base score-first TTT scaffold, GPTQ int6 / int7 embeddings, Brotli-11 compression
+- **PR #1626 @dexhunter** — Multi-Phase Global SGD TTT (`eval_val_ttt_phased`, `train_val_ttt_global_sgd_distributed`, the per-batch `BatchedTTTLoRA` with reset, phased boundaries, global SGD on scored documents only)
+- **PR #1019 @abaybektursun** — the currently merged record-track rank 1
+
+I ported the MP-SGD functions from PR #1626 verbatim into the PR #1667 base, preserved the per-chunk score-before-update ordering exactly, and added env-var gates so `PHASED_TTT_ENABLED=1` selects the phased path and the default (0) uses the existing single-phase path. Nothing was rewritten or simplified from PR #1626's TTT code.
+
+## Issue #1017 compliance (Track B)
+
+All four conditions addressed:
+
+1. **Condition 1 (Strict causal dependence):** LoRA state at chunk `t` is constructed only from the prefix. Base model weight updates via `train_val_ttt_global_sgd_distributed` happen only at phase boundaries and operate on tokens from documents whose scoring already completed (`local_scored_docs` is populated after each batch's inner chunk loop completes). No future tokens influence any past score.
+2. **Condition 2 (Full normalized distribution):** Standard softmax over the full sentencepiece vocabulary `Σ` of size 8192. No bucket normalization, no hash-bin redistribution, no `x_t`-contingent completion. The output distribution at position `t` is determined independently of the realized token.
+3. **Condition 3 (Score-before-update):** Per-chunk: forward pass on the current chunk runs under `torch.no_grad()` path for accumulation into `loss_sum`, and the LoRA gradient step runs only after that accumulation is complete (`if needs_train:` guard, which is false on the last chunk of each document). Global level: `train_val_ttt_global_sgd_distributed` is invoked at phase boundaries on tokens from already-scored documents, not on live tokens. The last chunk of each training slice is explicitly skipped (`is_last_chunk: continue`) as a protective measure.
+4. **Condition 4 (Single left-to-right pass):** Each batch is claimed exactly once via `_claim_next_batch` (atomic file-lock counter). No rescoring loop. `loss_sum` is append-only throughout evaluation.
+
+The MP-SGD code paths in this submission are unchanged from PR #1626, which has already been accepted as Issue #1017 Track B compliant in the community.
+
+## Hardware / reproducibility
+
+- **Pod:** 8x NVIDIA H100 80GB HBM3 SXM in Kansas City, Missouri (US-MO-1 datacenter)
+- **Per-GPU GEMM (pod-test.sh measurement):** 0.21 ms bf16 4096x4096 (about 657 TFLOPS per GPU)
+- **NVLink:** 18 bonded NVLinks per GPU pair (NV18 all-pairs)
+- **CPU:** Intel Xeon Platinum 8470, 208 threads
+- **Torch:** 2.9.1+cu128, Triton 3.5.1, flash_attn_interface prebuilt wheel from https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+- **Image:** `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404`
+
+## Run command (per seed)
+
+```bash
+# Env defaults reproduce the submission exactly:
+SEED=<seed> \
+TTT_ENABLED=1 \
+PHASED_TTT_ENABLED=1 \
+PHASED_TTT_NUM_PHASES=3 \
+PHASED_TTT_PREFIX_DOCS=2000 \
+GLOBAL_TTT_LR=0.001 \
+GLOBAL_TTT_EPOCHS=1 \
+SMEAR_GATE=0 \
+GATE_ATTN_OUT=0 \
+DATA_DIR=/workspace/track-a/data/ \
+ARTIFACT_DIR=<output dir> \
+RUN_ID=<run id> \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Attribution notes
+
+- The `train_gpt.py` in this folder contains two development-only shims that are inert on H100: (1) a Flash Attention backend auto-detect that falls through from FA3 to FA2 to SDPA based on `torch.cuda.get_device_capability` (activates only on `cc[0]==12`, Blackwell), and (2) a Triton block-size override for `linear_leaky_relu_square_kernel` (activates only on `cc[0]==12`). Both are no-ops on H100 Hopper and do not affect the submission path. They exist so the same file can be developed on a Blackwell dev box (where FA3 runtime kernels fail) without forking the code.
+- No changes to the core model architecture, training loop, quantization pipeline, or evaluation code relative to PR #1667 and PR #1626.
+
+## Delta vs the MP-SGD source (PR #1626)
+
+- PR #1626 with vanilla SP8192 reports val_bpb 1.07193 (single seed in the PR log; I did not rerun it).
+- This submission's 3-seed mean is 1.07285. The ~0.001 gap is within the 3-seed std (0.00051 here) plus what I'd expect from the seed mix we used (1337, 42, 0) vs PR #1626's seed choice.
+- I did not introduce SmearGate or AttnOutGate (both disabled). I did not introduce CaseOps (vanilla SP8192). The only deliberate change to the MP-SGD recipe is inheriting PR #1667's base config defaults (for example, `MATRIX_LR=0.04`, `EMBED_LR=0.05`, `MUON_WD=0.095`, which differ slightly from PR #1626's defaults).
diff --git a/records/track_10min_16mb/2026-04-22_SP8192_NoGates_MPSGD_3phase_1.0729/submission.json b/records/track_10min_16mb/2026-04-22_SP8192_NoGates_MPSGD_3phase_1.0729/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "SP8192 + No Gates + Multi-Phase Global SGD TTT (3-seed)",
+  "val_bpb": 1.07285,
+  "bytes_total": 15938268,
+  "blurb": "pr1667 MarioPaerle base architecture with SmearGate and AttnOutGate both disabled, and single-phase score-first TTT replaced by pr1626 dexhunter's Multi-Phase Global SGD TTT (3 phases, 2000 prefix docs). 3-seed mean val_bpb 1.07285, std 0.00051 across seeds 1337, 42, 0 on 8xH100 SXM. No tokenizer changes (vanilla SP8192). Compliant with all four conditions of Issue #1017 Track B.",
+  "author": "Nathan Maine",
+  "github_id": "NathanMaine",
+  "date": "2026-04-22"
+}