Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Experiment Log: 12L + Low-Rank Q + QAT + FTLE + Stride-OGD

**Hardware: 1xH100 80GB HBM3 (development/testing)**
**Target: 8xH100 SXM for final submission**

## Plan

Combine 4 novel techniques that nobody has combined in the competition yet:
1. **Low-Rank Q + 12 layers**: Low-Rank Q factorization (rank=128) gives ~8% faster steps per layer, funding 12 layers instead of 10
2. **QAT with STE**: Quantization-Aware Training with straight-through estimator reduces quant gap from ~0.016 to ~0.005 BPB
3. **FTLE-guided per-row precision**: Instead of blanket int6 for middle layers, use accumulated gradient sensitivity (FTLE) to allocate precision per-row. Hot rows get int6-7, cold rows get int4-5
4. **Stride-OGD at eval**: Online gradient descent on a 1024-dim vocab bias during stride-64 sliding window eval — free BPB improvement

## Size Budget Analysis

- 12L with Low-Rank Q (r=128): ~19.4M params
- Need mixed precision to fit in 16MB
- Target: avg ~int5.5 effective bits → ~15MB compressed
- fp16 embedding (tied) stays at 1.0MB

---

## Log

### 2026-03-21 03:20 UTC — Project kickoff (1xH100)
- Analyzed current SOTA: 1.1748 bpb (10L, sliding window, fp16 embed, Muon WD, overtone init)
- Analyzed int6 mixed precision record: 1.2147 bpb (10L, int8/int6 mixed)
- Designed combined approach targeting 12L + all 4 techniques
- Created record directory, beginning implementation
- Data download started (10 shards for dev testing)

### 2026-03-21 03:25 UTC — Implementation start (1xH100)
- Writing train_gpt.py based on current SOTA script
- Adding Low-Rank Q, QAT, FTLE gradient tracking, per-row precision quantization, Stride-OGD eval
- Script: 1382 lines (under 1500 limit)

### 2026-03-21 03:28 UTC — Smoke test v1 FAILED (1xH100)
- `enable_gqa` kwarg not supported in PyTorch 2.4.1
- Fix: manual KV head repetition for GQA compatibility

### 2026-03-21 03:30 UTC — Smoke test v2 (1xH100, 143 steps)
- **Model: 20,999,264 params** (12L, Low-Rank Q r=128)
- **Memory: 17,438 MiB** (fits well on 80GB H100)
- **Step time: ~840ms on 1xH100** → est. ~105ms/step on 8xH100
- **val_bpb: 3.0655** at step 143 (very early, loss still dropping fast)
- **Artifact: 6.6MB at int6** — way under 16MB! Bug found: quant search was starting at int6, not int8

### 2026-03-21 03:35 UTC — Fixes applied (1xH100)
- Fixed quant bit search to go int8→int7→...→int5 (high to low)
- Increased QAT default bits from 6 to 7 (matches likely export precision)
- Fixed QAT activation bug: now works with both wallclock and iteration-based triggers
- Started 2000-step training run for meaningful metrics

### 2026-03-21 03:55 UTC — 2000-step test results (1xH100, no QAT, no OGD)
- **val_bpb: 1.2720** (pre-quant, standard eval) at step 2000
- **val_bpb: 1.2517** (post-quant, sliding window stride=64) — free -0.02 from sliding window!
- Quant gap: 0.0203 bpb (FTLE-guided at avg 6.5 bits)
- Step time: **609.6ms on 1xH100** → est. **~76ms/step on 8xH100** → ~7900 steps in 10min
- Memory: 17,310 MiB
- Artifact: **15,213,080 bytes** (under 16MB cap!)
- Compression results: int8→17.6MB, int7.5→17.0MB, int7→16.3MB, **int6.5→15.2MB**
- GPU: 74-94% SM util, 544-570W, ~10% MFU (expected for 512-dim bandwidth-bound model)
- Note: QAT did NOT activate (bug with wallclock=0, now fixed)
- Note: OGD was disabled for this test

### Key observations from 2000-step test:
- 1.2517 bpb at only 2000 steps already beats baseline (1.2244)!
- FTLE tracked 98 tensors over 20 gradient samples
- 12L is learning well even at reduced step count
- Quant gap of 0.0203 is large — QAT should reduce this significantly

### 2026-03-21 04:10 UTC — Full 7900-step run started (1xH100, QAT + OGD)
- Simulating 8xH100 10min (7900 steps at est. ~76ms/step on 8xH100)
- QAT enabled at step 790 (10% of training), int7 fake quantization
- OGD eval enabled (stride=64, lr=0.1)
- WARMDOWN_ITERS=2000

### 2026-03-21 05:30 UTC — 7900-step training COMPLETE, eval killed (1xH100)
- **Pre-quant val_bpb: 1.2035** at step 7900!

Training curve:
| Step | val_bpb | Note |
|------|---------|------|
| 1000 | 1.3799 | QAT just enabled at 790 |
| 2000 | 1.3285 | |
| 3000 | 1.3106 | |
| 4000 | 1.2980 | |
| 5000 | 1.2935 | |
| 6000 | 1.2852 | Warmdown started at 5900 |
| 7000 | 1.2447 | |
| 7900 | **1.2035** | Final |

- Step time: ~616ms/step (est. ~77ms on 8xH100 → ~7800 steps in 10min)
- QAT overhead: ~6% step time increase (615→654ms at activation)
- FTLE: 98 tensors over 79 gradient samples
- Compression: int6.0 avg bits → **15.5MB** (under 16MB cap)
- Quant gap TBD — OGD eval was killed due to extreme slowness

### Issue: OGD eval too slow
- OGD requires gradient tracking through [256, 1024, 1024] logits tensor
- Memory jumps from 17GB to 28GB during OGD eval
- Estimated 30-60 min for full eval — unacceptable
- Need to either disable OGD or make it batch-efficient

### 2026-03-21 06:00 UTC — FTLE ablation: FTLE does NOT help (1xH100)

Ran A/B comparison on saved 7900-step model weights: uniform per-row quantization
vs FTLE-guided per-row quantization at matched bit widths.

| Bits | Method | Compressed Size | RMSE | Fits 16MB? |
|------|--------|-----------------|------|------------|
| 8 | Uniform int8 | 18,847,383 | 0.002144 | NO |
| 8 | FTLE avg 8 | 18,708,158 | 0.002795 | NO |
| 7 | **Uniform int7** | **16,909,971** | **0.004326** | NO (by 0.9MB) |
| 7 | FTLE avg 7 | 17,254,281 | 0.005466 | NO |
| 6 | **Uniform int6** | **15,178,239** | **0.008781** | YES |
| 6 | FTLE avg 6 | 15,436,864 | 0.010927 | YES |
| 5 | Uniform int5 | 12,696,812 | 0.018136 | YES |
| 5 | FTLE avg 5 | 13,086,431 | 0.020907 | YES |

**Conclusion: Uniform beats FTLE on BOTH size and RMSE at every bit width.**

FTLE-guided mixed precision (int4–int8 per row) produces:
- Higher RMSE: mixing int4 "cold" rows with int8 "hot" rows is worse than uniform int6
- Larger compressed size: mixed bit values have higher entropy, zlib compresses them worse

**Recommendation:** Drop FTLE entirely. Use uniform int6 (15.2MB, fits) or try to
squeeze uniform int7 (16.9MB, 0.9MB over — could work with code size reduction or
slightly smaller model).

### Projected final bpb (without FTLE):
- Pre-quant: 1.2035
- Uniform int6 + sliding window (no OGD): est. **~1.19 bpb**
- Uniform int7 + sliding window (if fits): est. **~1.17-1.18 bpb**
- Current SOTA for reference: **1.1748 bpb**

---

## Summary of Technique Effectiveness (as of 2026-03-21)

| Technique | Status | Verdict |
|-----------|--------|---------|
| Low-Rank Q (r=128) + 12 layers | Working | Promising — 12L at 1.2035 pre-quant |
| QAT with STE (int7) | Working | ~6% step overhead, quant gap TBD |
| FTLE per-row precision | Tested | **Not helpful — uniform is strictly better** |
| Stride-OGD eval | Implemented | Too slow as-is, needs optimization or removal |
| Sliding window eval (stride=64) | Working | Free ~0.03 bpb improvement |
| Muon weight decay | Inherited from SOTA | Working |
| Overtone spectral init | Inherited from SOTA | Working |
| Phase-transition resid_mix | Inherited from SOTA | Working |
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# 12-Layer Low-Rank Q + QAT: A Cross-Disciplinary Research Pipeline

**Non-record submission** — developed on 1xH100, awaiting 8xH100 for official scoring.

## Results

| Metric | Value |
|--------|-------|
| Pre-quant val_bpb (1xH100, 7900 steps) | **1.2035** |
| Projected post-quant (int6 + sliding window s64) | **~1.19** |
| Architecture | 12L, 512d, MLP 3x, Low-Rank Q (r=128) |
| Params | ~20.9M |
| Artifact | ~15.2MB (uniform int6 + zstd-22) |
| Projected 8xH100 step time | ~77ms → ~7800 steps in 10min |

## Approach

We started from the current SOTA techniques (int6 quantization, MLP 3x, SmearGate, BigramHash, Muon WD, overtone spectral init, sliding window eval) and asked: **what novel contributions can push the frontier?**

Our approach was grounded in cross-disciplinary ideas from dynamical systems, fluid mechanics, and information theory — prototyped cheaply on Apple Silicon, validated on A100, and refined on H100.

### What we built

**1. Low-Rank Q factorization (r=128) → 12 layers**

Inspired by PR #215's finding that Q matrices have extreme condition numbers (>100M), we factor Q as `dim→128→dim` per layer. This saves ~50% of Q params and makes each step ~8% faster. The speed savings fund a **12th transformer layer** — nobody else has gone to 12L yet.

The intuition from linear algebra: Q's effective rank is ~100-120 out of 512. The remaining singular values are noise that quantization destroys anyway. By factoring Q, we remove that noise at training time rather than losing it at quantization time.

**2. QAT with Straight-Through Estimator (int7)**

During training, we simulate int7 quantization in the forward pass using fake-quantize + STE for gradients. Activated at 10% of training (early — the model co-adapts with quantization noise from the start). 6% step time overhead.

The motivation: the quantization gap (pre-quant vs post-quant BPB) is one of the largest remaining sources of loss. QAT directly trains the model to be robust to it.

**3. FTLE-guided per-row precision (tested, negative result)**

We tracked per-row gradient sensitivity during training (a proxy for the Finite-Time Lyapunov Exponent from dynamical systems theory) and used it to allocate quantization precision per row — more bits for "hot" (sensitive) rows, fewer for "cold" rows.

**Result: uniform quantization is strictly better than FTLE-guided at every bit width.** Mixing int4 cold rows with int8 hot rows produces higher RMSE AND larger compressed size (mixed values have higher entropy → worse zstd compression). This is a clean negative result that saves future researchers from this path.

**4. Stride-OGD at eval (implemented, too slow)**

Online gradient descent on a 1024-dim vocabulary bias during sliding window evaluation. Zero artifact cost — the bias is computed from the eval text itself. The idea is sound (PR #241) but our implementation requires gradient tracking through [batch, 1024, 1024] logits tensors, which is prohibitively slow (~30-60 min for full eval).

## Research Pipeline

This submission is the output of a 3-stage research pipeline:

### Stage 1: Apple Silicon prototyping (18GB Mac)
- Created `make_mini_shards.py` for sub-1MB data subsets
- Tested layer sharing (depth recurrence): 3 shared blocks = 9 unique at 1/3 params
- Found optimal tiny config: 2 shared, 256d, MLP 3x, 1.45M params → 1.783 BPB locally
- Validated DEQ convergence theory: trained shared blocks become contractive (Lyapunov δ decreasing)
- Built FTLE sensitivity tracking infrastructure

### Stage 2: A100 validation (TACC Lonestar6, 1xA100 40GB)
- **Layer sharing abandoned at 512d** — costs 0.09 BPB vs unique layers (the 16MB budget already fits enough unique params)
- Integrated BigramHash + SmearGate → 0.094 BPB improvement
- Best A100 result: **1.3260 BPB** (9L, zstd-22, sliding window s1024)
- Identified 6 high-confidence improvements from competition PRs

### Stage 3: H100 refinement (1xH100 80GB)
- Implemented Low-Rank Q + 12 layers + QAT + FTLE + Stride-OGD
- Pre-quant val_bpb: **1.2035** at 7900 steps
- Clean negative result on FTLE per-row precision
- Stride-OGD needs optimization (too slow as-is)

## What We'd Do With a $500 RunPod Dev Grant

**Phase 1: 8xH100 validation ($25, ~2 hours)**
- Run our 12L + Low-Rank Q + QAT config on 8xH100 for proper scoring
- Expected: ~7800 steps in 10min at ~77ms/step
- A/B test: uniform int6 vs int7 (int7 = 16.9MB, need to trim 0.9MB via smaller code or BigramHash table)
- Target: sub-1.17 BPB post-quant with sliding window

**Phase 2: Hyperparameter sweep ($50, ~4 hours)**
- WD sweep: 0.03, 0.04, 0.05 (competition found 0.04 optimal)
- LR sweep: matrix_lr 0.020-0.035 (PR #198 uses 0.025)
- Muon momentum: 0.95, 0.97, 0.99 (PR #198 uses 0.99 with warmup from 0.92)
- SWA cadence: every 25, 50, 100 steps during warmdown (our continuous SWA hurt, periodic might help)
- Each A/B test: ~$3 per 10-min run

**Phase 3: Novel combinations ($100, ~8 hours)**
- 12L + int5-MLP/int6-attn mixed quantization (PR #180's technique + our 12L)
- QAT specifically for int5 MLP weights (nobody has combined QAT with int5/int6 mixed)
- Fix Stride-OGD eval speed (batch the gradient computation)
- Try 13 layers if Low-Rank Q speed savings allow
- Content-dependent pre-rotation (PR #215's promising-but-failed idea — we'd try a Triton kernel)

**Phase 4: Multi-seed validation + submission ($25)**
- 3-seed runs of the best config
- Statistical significance test (p < 0.01)
- Package records/ folder and submit PR

**Phase 5: Moonshot experiments ($300, remaining budget)**
- Stride-OGD + Two-Pass eval combined with full stack
- NTK-RoPE 4096 at eval (4x context without retraining)
- Adaptive Low-Rank Q (different rank per layer based on spectral analysis)
- BitNet b1.58 exploration (ternary weights for 5x more params in same space)

**Total: ~$200 for competitive submission, ~$500 to explore the frontier.**

## Files

- `train_gpt.py` — Full training script (1388 lines). Based on SOTA WarmdownQuantization record with Low-Rank Q, QAT, FTLE tracking, and Stride-OGD added.
- `EXPERIMENT_LOG.md` — Detailed H100 experiment log with training curves and ablations.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"author": "SkywardSyntax",
"github_id": "SkywardSyntax",
"name": "12L Low-Rank Q + QAT Research Pipeline (1xH100 development)",
"blurb": "12-layer transformer with Low-Rank Q factorization (r=128) and QAT, developed through a 3-stage pipeline: Apple Silicon prototyping → A100 validation → H100 refinement. Pre-quant val_bpb=1.2035 on 1xH100 (7900 steps). Includes clean negative results on FTLE-guided per-row precision (uniform quantization is strictly better) and Stride-OGD eval (too slow as-is). Awaiting 8xH100 compute for official scoring.",
"date": "2026-03-21T06:00:00Z",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": null,
"track": "track_non_record_16mb",
"status": "awaiting_8xH100_compute",
"hardware_used": "1xH100 80GB (development)",
"pre_quant_val_bpb": 1.2035,
"projected_post_quant_bpb": 1.19
}
Loading