Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions APPROACH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Parameter Golf — Approach Notes

## Strategy Overview

Maximize language model quality within a 16MB artifact constraint and 10 minutes on 8×H100s. Five pillars informed by research in model compression, efficient architectures, and training optimization.

---

## 1. Depth Recurrence (Layer Sharing)

Instead of unique parameters per layer, reuse a small set of transformer blocks recursively. A 4-block recursive model with 8 passes achieves the effective depth of a 32-layer network while only storing 4 layers of parameters.

Research shows recursive transformers achieve comparable loss to standard architectures with 3-4× fewer parameters. The model learns to refine representations through repeated application of the same weights — a form of iterative refinement that naturally suits the extreme parameter constraint.

**Target:** Replace 12 unique layers with 4 recursive blocks × 3 passes = 12 effective layers at 1/3 the parameter cost.

## 2. Factorized Embeddings

The embedding matrix is often the largest single component. Instead of a full V×H matrix, decompose it into V×E and E×H where E << H. This technique (from ALBERT) can reduce embedding parameters by 80%+ while maintaining representation quality.

Combined with tied input/output embeddings, this eliminates the output projection layer entirely — the same factorized embedding serves both input and output.

**Math:** At vocab 1024, hidden 512: Full = 524K params. Factorized (E=128): 131K + 65K = 196K params. Savings: 63%.

## 3. Quantization-Aware Training (QAT)

Train the model knowing it will be quantized. The model learns weight distributions that survive low-precision conversion. At 2-bit precision, 16MB supports ~32M parameters.

Key insight: post-training quantization at 2-bit loses 15-20% quality. QAT at 2-bit loses only ~4%. The difference is massive at this scale.

**Approach:** Train at FP16/BF16, apply QAT during training with straight-through estimators, export at 2-bit for the final artifact.

## 4. Knowledge Distillation

Use a larger pretrained model as a teacher during training. The 8×H100 budget can run a 7B teacher alongside a 32M student. The student learns from soft probability distributions rather than hard labels, capturing more knowledge per training step.

Distillation is especially powerful for small models — the teacher provides a richer gradient signal than raw cross-entropy on token predictions alone.

## 5. Training Maximization

Every second of the 10-minute budget matters:

- **Sequence packing:** Multiple short examples per input sequence, no wasted padding tokens
- **Curriculum ordering:** Train on FineWeb examples ordered by difficulty (shorter/simpler first, longer/complex later) for faster initial convergence
- **Cosine LR schedule:** High initial learning rate with cosine decay over the 10-minute window
- **Gradient accumulation:** Effective batch size tuned for optimal loss curves on H100s
- **Mixed precision training:** BF16 compute for speed, QAT checkpoints for artifact size

## 6. Tokenizer Optimization

Vocabulary size directly impacts embedding parameter count. The baseline uses 1024 tokens. Exploring:

- Smaller BPE vocabularies (512, 256) — fewer embedding parameters but worse compression
- The tradeoff is parameter cost vs bytes-per-token — the evaluation metric is bits per byte, so better compression from larger vocab can offset the parameter cost
- Custom tokenizer trained specifically on FineWeb distribution

## 7. Alternative Architectures

Beyond standard transformers:

- **State-space models (Mamba-style):** Linear scaling with sequence length, potentially more parameter-efficient for the same quality
- **Mixture of Experts at micro-scale:** Multiple tiny FFN experts with a router — only a subset active per token, more capacity per parameter
- **Depth-adaptive inference:** Early exit for easy tokens, full depth for hard ones — maximizes quality where it matters most

---

## The Math

| Bitwidth | Parameters in 16MB | Architecture |
|----------|-------------------|-------------|
| 2-bit | ~32M | Recursive transformer, factorized embeddings |
| 3-bit | ~21M | Standard transformer, tied embeddings |
| 4-bit | ~16M | Compact transformer |

## Experiment Plan

- [ ] Run baseline (9-layer, 512-dim, 1024-vocab, tied embeddings) — establish score to beat (1.2244)
- [ ] Implement depth recurrence (4 recursive blocks × 3 passes)
- [ ] Add factorized embeddings (V×128 + 128×H)
- [ ] Test 2-bit QAT during training
- [ ] Knowledge distillation with 7B teacher
- [ ] Curriculum data ordering on FineWeb
- [ ] Tokenizer vocabulary sweep (256, 512, 1024, 2048)
- [ ] Mamba/SSM architecture comparison
- [ ] Combine best techniques into final submission

## Background

5 production fine-tuned models (7B-72B) deployed via QLoRA/GGUF/NVFP4 quantization on NVIDIA DGX hardware. Built a 130K-chunk expert knowledge base for AI/ML research consultation. Deep experience with compression-quality tradeoffs across bitwidths.

## Status

Credits requested. Local experimentation with MLX baseline in progress.
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# SP8192 + No Gates + Multi-Phase Global SGD TTT

**val_bpb: 1.07285** (3-seed mean, std 0.00051) | **~15.94 MB** | 8xH100 SXM | Multi-Phase Global SGD TTT (Track B)

This record combines the base architecture from PR #1667 (MarioPaerle) with the Multi-Phase Global SGD TTT path from PR #1626 (dexhunter), with both SmearGate and AttnOutGate disabled. No tokenizer changes (vanilla SP8192). No Casefold or CaseOps. No SLOT.

## Results (8xH100 80GB SXM, Kansas City US, PyTorch 2.9.1+cu128, FA3)

| Seed | Steps | Train time | Post-TTT val_bpb | Post-TTT val_loss | Eval time | Artifact (bytes) |
|------|------:|-----------:|-----------------:|------------------:|----------:|-----------------:|
| 1337 | 4827 | 587.52s | 1.07333739 | 2.77254196 | 429.1s | 15,935,536 |
| 42 | 4839 | 587.16s | 1.07287895 | 2.77135776 | 338.7s | 15,935,501 |
| 0 | 4832 | 587.16s | 1.07232205 | 2.76991921 | 385.1s | 15,943,766 |
| **Mean** | **4833** | **587.28s** | **1.07285** | **2.77127** | **384.3s** | **15,938,268** |
| **Std** | | | **0.00051** | **0.00131** | | 4,805 |

All three seeds clear the 600s train budget, the 600s eval budget, and the 16,000,000-byte decimal artifact cap. The 3-seed std of 0.00051 BPB is well inside the 0.005-nat significance floor.

## What this submission is

This is a disciplined combinatorial submission that establishes two data points at full 8xH100 production scale:

1. **MP-SGD 3-phase TTT beats single-phase score-first TTT by 0.0028 BPB** on the same base architecture (single-phase run on the same pod produced 1.07612, this run produced 1.07334 for seed 1337).
2. **Disabling SmearGate and AttnOutGate from PR #1667's base does not hurt this configuration.** Reasoning for this came from community observations that PR #1736 and PR #1756 shipped with both gates plumbed but flagged off in their winning runs; I validated the direction on Spark ablations first, then reproduced at H100 production scale.

It does not attempt a novel architecture. It isolates a specific hypothesis (MP-SGD over single-phase TTT) and answers it at full scale.

## Lineage / attribution

- **PR #1667 @MarioPaerle** — SP8192 base architecture, 11L x 512d x 8H / 4KV, Partial RoPE 16/64, Loop L3-5, Parallel Residuals L7+, QK-Gain 5.25, MuonEq-R optimizer, Skip gates, SmearGate and AttnOutGate (both disabled in this submission), base score-first TTT scaffold, GPTQ int6 / int7 embeddings, Brotli-11 compression
- **PR #1626 @dexhunter** — Multi-Phase Global SGD TTT (`eval_val_ttt_phased`, `train_val_ttt_global_sgd_distributed`, the per-batch `BatchedTTTLoRA` with reset, phased boundaries, global SGD on scored documents only)
- **PR #1019 @abaybektursun** — the currently merged record-track rank 1

I ported the MP-SGD functions from PR #1626 verbatim into the PR #1667 base, preserved the per-chunk score-before-update ordering exactly, and added env-var gates so `PHASED_TTT_ENABLED=1` selects the phased path and the default (0) uses the existing single-phase path. Nothing was rewritten or simplified from PR #1626's TTT code.

## Issue #1017 compliance (Track B)

All four conditions addressed:

1. **Condition 1 (Strict causal dependence):** LoRA state at chunk `t` is constructed only from the prefix. Base model weight updates via `train_val_ttt_global_sgd_distributed` happen only at phase boundaries and operate on tokens from documents whose scoring already completed (`local_scored_docs` is populated after each batch's inner chunk loop completes). No future tokens influence any past score.
2. **Condition 2 (Full normalized distribution):** Standard softmax over the full sentencepiece vocabulary `Σ` of size 8192. No bucket normalization, no hash-bin redistribution, no `x_t`-contingent completion. The output distribution at position `t` is determined independently of the realized token.
3. **Condition 3 (Score-before-update):** Per-chunk: forward pass on the current chunk runs under `torch.no_grad()` path for accumulation into `loss_sum`, and the LoRA gradient step runs only after that accumulation is complete (`if needs_train:` guard, which is false on the last chunk of each document). Global level: `train_val_ttt_global_sgd_distributed` is invoked at phase boundaries on tokens from already-scored documents, not on live tokens. The last chunk of each training slice is explicitly skipped (`is_last_chunk: continue`) as a protective measure.
4. **Condition 4 (Single left-to-right pass):** Each batch is claimed exactly once via `_claim_next_batch` (atomic file-lock counter). No rescoring loop. `loss_sum` is append-only throughout evaluation.

The MP-SGD code paths in this submission are unchanged from PR #1626, which has already been accepted as Issue #1017 Track B compliant in the community.

## Hardware / reproducibility

- **Pod:** 8x NVIDIA H100 80GB HBM3 SXM in Kansas City, Missouri (US-MO-1 datacenter)
- **Per-GPU GEMM (pod-test.sh measurement):** 0.21 ms bf16 4096x4096 (about 657 TFLOPS per GPU)
- **NVLink:** 18 bonded NVLinks per GPU pair (NV18 all-pairs)
- **CPU:** Intel Xeon Platinum 8470, 208 threads
- **Torch:** 2.9.1+cu128, Triton 3.5.1, flash_attn_interface prebuilt wheel from https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
- **Image:** `runpod/pytorch:1.0.3-cu1281-torch291-ubuntu2404`

## Run command (per seed)

```bash
# Env defaults reproduce the submission exactly:
SEED=<seed> \
TTT_ENABLED=1 \
PHASED_TTT_ENABLED=1 \
PHASED_TTT_NUM_PHASES=3 \
PHASED_TTT_PREFIX_DOCS=2000 \
GLOBAL_TTT_LR=0.001 \
GLOBAL_TTT_EPOCHS=1 \
SMEAR_GATE=0 \
GATE_ATTN_OUT=0 \
DATA_DIR=/workspace/track-a/data/ \
ARTIFACT_DIR=<output dir> \
RUN_ID=<run id> \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Attribution notes

- The `train_gpt.py` in this folder contains two development-only shims that are inert on H100: (1) a Flash Attention backend auto-detect that falls through from FA3 to FA2 to SDPA based on `torch.cuda.get_device_capability` (activates only on `cc[0]==12`, Blackwell), and (2) a Triton block-size override for `linear_leaky_relu_square_kernel` (activates only on `cc[0]==12`). Both are no-ops on H100 Hopper and do not affect the submission path. They exist so the same file can be developed on a Blackwell dev box (where FA3 runtime kernels fail) without forking the code.
- No changes to the core model architecture, training loop, quantization pipeline, or evaluation code relative to PR #1667 and PR #1626.

## Delta vs the MP-SGD source (PR #1626)

- PR #1626 with vanilla SP8192 reports val_bpb 1.07193 (single seed in the PR log; I did not rerun it).
- This submission's 3-seed mean is 1.07285. The ~0.001 gap is within the 3-seed std (0.00051 here) plus what I'd expect from the seed mix we used (1337, 42, 0) vs PR #1626's seed choice.
- I did not introduce SmearGate or AttnOutGate (both disabled). I did not introduce CaseOps (vanilla SP8192). The only deliberate change to the MP-SGD recipe is inheriting PR #1667's base config defaults (for example, `MATRIX_LR=0.04`, `EMBED_LR=0.05`, `MUON_WD=0.095`, which differ slightly from PR #1626's defaults).
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "SP8192 + No Gates + Multi-Phase Global SGD TTT (3-seed)",
"val_bpb": 1.07285,
"bytes_total": 15938268,
"blurb": "pr1667 MarioPaerle base architecture with SmearGate and AttnOutGate both disabled, and single-phase score-first TTT replaced by pr1626 dexhunter's Multi-Phase Global SGD TTT (3 phases, 2000 prefix docs). 3-seed mean val_bpb 1.07285, std 0.00051 across seeds 1337, 42, 0 on 8xH100 SXM. No tokenizer changes (vanilla SP8192). Compliant with all four conditions of Issue #1017 Track B.",
"author": "Nathan Maine",
"github_id": "NathanMaine",
"date": "2026-04-22"
}
Loading