Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 158 additions & 0 deletions records/track_non_record_16mb/2026-03-28_Distillation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Knowledge Distillation: A Negative Result (val_bpb: 1.1529, no improvement over baseline)

**val_bpb: 1.1529** (best distillation config, seed 42, 8xH100 SXM) | baseline: **1.1401** | **15.6 MB artifact**

## The Question

To clarify, this is a non-record submission on the Unlimited Compute track. The goal was to see whether distillation is even feasible under the constraints of this competition. I'm conveniently ignoring the fact that you'd also have to figure out how to train the teacher model within the time budget (spoiler: 6 hours of H200 time doesn't fit in 10 minutes).

Can a bigger teacher model help train a smaller student when every training step counts? I trained a 105M parameter teacher, cached its logits, and ran a "systematic ablation" across two distillation approaches, two teachers, and multiple mixing ratios.

Distillation doesn't improve anything at this budget. Any benefit from distillation and the step penalty roughly cancel each other out. Which makes me think the next experiment isn't a better neural teacher, but rather, a cheaper one. But that's a different submission.

## Teacher Model

| Metric | Value |
|---------------|--------------------------------|
| Architecture | 16L, 768-dim, 12 heads, MLP 4x |
| Parameters | 105.5M |
| Training | 50K steps, 5.6 hours on 4xH200 |
| Float BPB | 1.099 |
| Training cost | ~6 hours of H200 time |

0.05 BPB better than the student on its own. Legit teacher.

## What I Tried

Two distillation approaches, two teachers, multiple mixing ratios. All compared against the same student trained on hard labels.

### Hard Distillation (teacher's top-1 prediction as label)

Swap some fraction of the ground truth labels with whatever the teacher is most confident about. Like asking a friend who took the test last year for the answers instead of studying the textbook.

| Teacher | Alpha | Sliding BPB | vs Baseline |
|----------------------|-------|-------------|-------------|
| none (baseline) | 0.0 | 1.152 | — |
| self (27M, 1.15 BPB) | 0.1 | 1.242 | +0.090 |
| self (27M, 1.15 BPB) | 0.5 | 1.559 | +0.407 |
| big (105M, 1.10 BPB) | 0.1 | 1.242 | +0.090 |

Catastrophic at every setting. Turns out your friend wasn't as solid on the material as they thought. The teacher's top-1 prediction is just a noisy version of the ground truth, wrong often enough to actively hurt training. Teacher quality doesn't even matter here. The 105M teacher and the 27M self-teacher produce nearly identical results at alpha=0.1 (1.2417 vs 1.2423).

### Soft Distillation (KL divergence against teacher's distribution)

Instead of the teacher's best guess, use the probability distribution as a soft target. The student learns "the teacher thinks it's probably X, maybe Y, definitely not Z." Standard Hinton et al. approach.

I cached the teacher’s top-32 logits per position to avoid running the teacher during training. The output vocabulary is 1024 tokens (BigramHash is input-side only, doesn’t affect the prediction head), so that’s 3.1% of the distribution. Necessary compromise to keep I/O overhead from killing the run. The lookup added ~11ms overhead per step (159ms vs 148ms baseline).

| Teacher | Alpha | Temp | Sliding BPB | vs Baseline |
|------------|-------|------|-------------|-------------|
| none | 0.0 | — | 1.152 | — |
| big (105M) | 0.1 | 2.0 | 1.155 | +0.003 |
| big (105M) | 0.3 | 2.0 | 1.156 | +0.003 |
| big (105M) | 0.5 | 2.0 | 1.156 | +0.004 |

Way better than hard distillation. But still can't beat the baseline. The gap is roughly 0.003 BPB regardless of alpha. Flat across the sweep.

## Why It Doesn't Work (given our time constraints)

The ~11ms step overhead from cached logit lookup costs ~280 training steps (3773 vs 4051 in 600s). The distillation benefit has to overcome that step penalty. It doesn't. Knowledge transfer and step cost roughly cancel out.

It's like reading CliffsNotes instead of the actual book. Sure, CliffsNotes are a shortcut. But if you only have 10 minutes to study, it doesn't really matter which version you study, you still only studied for 10 minutes. The shortcut has to be dramatically more efficient per-second to make up for this.

### The Online Distillation Problem

I also tried running the teacher live during training (no caching). Teacher forward pass pushed step time to 761ms. This resulted in only 789 steps instead of 4051. The student barely trained. 1.564 BPB. Cached logits are mandatory.

## Extended Training: Does Distillation Ever Cross Baseline?

The 10-minute results left an open question: maybe distillation just needs more time. So I ran both configs for 2 hours and tracked val_bpb along the way.

| Step | Baseline | Distillation | Delta |
|-------|----------|--------------|--------|
| 1K | 1.306 | 1.308 | +0.002 |
| 5K | 1.232 | 1.233 | +0.001 |
| 10K | 1.207 | 1.210 | +0.003 |
| 20K | 1.195 | 1.195 | 0.000 |
| 30K | 1.191 | 1.190 | -0.001 |
| 40K | 1.188 | 1.189 | +0.001 |
| 45K | 1.121 | 1.134 | +0.013 |
| final | 1.106 | 1.107 | +0.001 |

The curves track almost exactly. No crossover. At step 30K the distillation model briefly pulls ahead by 0.001, then falls back.

The step 45K jump is warmdown kicking in. Both models drop sharply, but baseline benefits more (1.188 → 1.121, a −0.067 drop) than distillation (1.189 → 1.134, only −0.055). That +0.013 delta is the largest gap in the entire extended run. They converge again by the final checkpoint. I don't have a clean explanation for why distillation gets less out of warmdown, but the gap doesn't stick around so I'm not reading too much into it.

One thing that changed at the longer timescale: per-step overhead dropped from ~11ms to ~7.7ms (back-calculated from 46356 vs 48777 total steps in 7200s). Probably the OS page cache warming up for the logit lookups. So the overhead argument actually gets weaker over time... but the distillation benefit doesn't grow to fill that gap. The curves just stay locked together.

## The Top-32 Caveat

I only cached the teacher’s top-32 logits per position out of a 1,024-token output vocabulary. That’s 96.9% of the distribution I’m throwing away. Hinton’s whole "dark knowledge" argument is that the tail carries useful signal ("the teacher thinks this is definitely not Z"), and I’m cutting that tail off entirely.

This was a deliberate tradeoff, not an oversight. The top-32 tokens capture the vast majority of the teacher’s probability mass, and caching the full 1,024 distribution would have increased cache size and lookup time. At a 10-minute budget, I/O is the enemy.

So this experiment really tests whether the plausible alternatives (the top of the teacher’s distribution) can provide enough signal to pay for their own overhead. They couldn’t.

## What Actually Happened

Hard distillation is catastrophic. The teacher's most confident prediction is a noisy label that hurts every time, and teacher quality doesn't change this. A 105M teacher does the same damage as a 27M self-teacher.

Soft KL distillation nearly matches baseline but can't beat it. The full distribution (well, top-32 of it) is useful signal, but the step overhead offsets it. Net result: ~0.003 BPB worse at all alpha values.

More training time doesn't fix it. Extended 2-hour runs show the curves tracking identically across ~49K baseline steps (~46K for distillation, since it loses steps to overhead). The step overhead drops at longer timescales, but distillation benefit doesn't grow to fill the gap.

The teacher's soft predictions aren't worth the cost. At this model scale, on this data distribution, whatever the teacher knows about inter-token correlations doesn't translate into faster student learning. That doesn't mean the dark knowledge isn't real (it probably is, and caching only top-32 logits limits what I can say here). It means whatever information is there isn't enough to overcome even modest overhead per step.

Running the teacher live during training is a non-starter. The forward pass overhead guts the training budget. Cached logits are the only viable path, and even those aren't enough.

## 3-Seed Validation (8xH100 SXM, 600s)

Self-distillation (student as its own teacher, same architecture):

| Seed | Baseline BPB | Distillation BPB | Delta | Notes |
|------|-------------|-----------------|-------|-------|
| 42 | 1.140 | 1.153 | +0.013 | |
| 1337 | 1.139 | 1.140 | +0.001 | baseline artifact slightly over 16MB* |
| 2024 | 1.140 | 1.141 | +0.001 | distill artifact slightly over 16MB* |

Distillation never beats baseline across any seed. The average gap (+0.005) is consistent with the H200 findings. Seed 42 shows a larger gap, likely due to random variation in warmdown timing.

*Seeds 1337 baseline (16.15 MB) and 2024 distillation (16.17 MB) are slightly over the 16,000,000-byte artifact limit due to seed-dependent quantization variance. Included as auxiliary evidence for the negative result, not as compliant runs. The remaining 4 of 6 runs are within limits and independently confirm the finding.

## Architecture

Student model (same as baseline):

| Component | Detail |
|---------------|------------------|
| Layers | 11 |
| Dim | 512 |
| Heads | 8 (4 KV, GQA) |
| MLP | 3x, relu-squared |
| XSA | Last 4 layers |
| EMA | 0.997 |
| Stored params | ~25M |

## Run Commands

```bash
# Cache teacher logits (one-time, ~5 min)
CACHE_TEACHER_LOGITS=1 TEACHER_PATH=workspace/teacher_model.pt \
torchrun --standalone --nproc_per_node=4 train_gpt.py

# Train student with cached soft KL distillation
DISTILL=1 DISTILL_CACHED=1 DISTILL_SOFT=1 DISTILL_ALPHA=0.1 \
DISTILL_TEMP=2.0 VE_ENABLED=1 WARMDOWN_ITERS=1600 \
NUM_LAYERS=11 XSA_LAST_N=4 EMA_ENABLED=1 LATE_QAT=1 \
BIGRAM_VOCAB_SIZE=6144 \
torchrun --standalone --nproc_per_node=4 train_gpt.py

# Baseline (no distillation)
VE_ENABLED=1 WARMDOWN_ITERS=1600 NUM_LAYERS=11 XSA_LAST_N=4 \
EMA_ENABLED=1 LATE_QAT=1 BIGRAM_VOCAB_SIZE=6144 \
torchrun --standalone --nproc_per_node=4 train_gpt.py
```

## Credits

First distillation experiment in Parameter Golf. Inspired by Hinton et al. (2015) "Distilling the Knowledge in a Neural Network.
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
## Distillation Teacher: COMPLETE
- Steps: 50000 @ 401ms (5.6 hours)
- val_bpb (float): 1.0986
- val_bpb (int6, sliding s64): **1.0246**
- Artifact: 52.25 MB (teacher, no size constraint)
- Checkpoint: workspace/ema_e15ddc5e-29bd-4b08-95b4-352d324ac4dd.pt
- Notes: Strong teacher. Ready for distillation experiments.

## Experiment DIST-2: Distillation alpha=0.5, temp=2.0
- Config: DISTILL=1 DISTILL_ALPHA=0.5 DISTILL_TEMP=2.0 + best student config
- Steps: 789 @ 761ms (only 789 steps due to 2x overhead from teacher forward!)
- val_bpb (float): 1.3399 (at step 789)
- val_bpb (int6, sliding): 1.5640
- Delta vs no-distill: +0.41 (MUCH WORSE)
- Status: DROP
- Notes: Distillation doubles step time (760ms vs 148ms), giving only 789 steps vs 4051 normally. The student can't train enough in 600s with the teacher overhead. Distillation needs >2x sample efficiency to be worthwhile, which it doesn't achieve at this early stage.

**CONCLUSION: Distillation is not viable under the 600s training constraint** on 4xH200. The teacher forward pass overhead is too expensive. Would need either:
- Cached teacher logits (precompute and save to disk)
- Or much longer training budget (unlimited compute track)

## SELFDIST-2: Cached self-distillation alpha=0.5 temp=2.0
- Steps: 1231 @ 488ms
- val_bpb: 3.88 (TERRIBLE, not learning)
- Status: DROP
- Notes: Two problems: (1) step overhead still 3x from extra forward_logits call (488ms vs 148ms), (2) KL div loss magnitude (~14000) drowns hard label signal. The student can't learn from either source. Need to either remove the extra forward pass or reduce alpha dramatically.

**KEY INSIGHT:** The extra `forward_logits()` call for student logits is the main bottleneck, not the teacher. Need to modify model.forward() to return both loss AND logits, or compute distillation loss inside the model's forward pass.

## SELFDIST-3: Hard distillation (teacher top-1), alpha=0.5
- Steps: 3969 @ 151ms (near-normal speed!)
- val_bpb (sliding): 1.559 (WORSE than no-distill 1.15)
- Status: DROP
- Notes: Step time is good (151ms, only 3ms overhead) but alpha=0.5 is too much teacher influence. The self-teacher's top-1 predictions are often wrong, hurting learning.

## SELFDIST-5a: Hard distillation alpha=0.1
- Steps: 3947 @ 152ms
- val_bpb (sliding): 1.2423 (still worse than no-distill 1.15)
- Status: DROP

## Self-Distillation Summary:
| Alpha | Sliding BPB | vs Baseline |
|-------|------------|-------------|
| 0.0 (baseline) | 1.1521 | — |
| 0.1 | 1.2423 | +0.09 (worse) |
| 0.5 | 1.5590 | +0.41 (worse) |

**CONCLUSION: Self-distillation from same-size teacher hurts at all alpha values.**
--
## SELFDIST-6: Big teacher (105M) hard distillation alpha=0.1
- Steps: 3972 @ 151ms
- val_bpb (sliding): 1.2417 (same as self-teacher, worse than baseline 1.15)
- Status: DROP

## Distillation Summary (all hard distillation, cached top-1):
| Teacher | Alpha | Sliding BPB | vs Baseline |
|---------|-------|------------|-------------|
| None | 0.0 | 1.1521 | — |
| Self (27M) | 0.1 | 1.2423 | +0.09 |
| Self (27M) | 0.5 | 1.5590 | +0.41 |
| Big (105M) | 0.1 | 1.2417 | +0.09 |

--
## Sanity Check: Modified script with all features OFF
- Steps: 4051 @ 148ms
- val_bpb (sliding): 1.1522 (matches exp 26's 1.1521 exactly!)
- Artifact: 15.8 MB (within limit)
- Status: PASS, modifications don't leak

## SELFDIST-4: Soft KL div, big teacher, alpha=0.5, T=2.0
- Steps: 3751 @ 160ms (~12ms overhead)
- val_bpb (sliding): **1.1558** (+0.0036 vs baseline 1.1522)
- Status: CLOSE but still worse
- Notes: Soft KL is MUCH better than hard top-1 (1.156 vs 1.242). Only +0.004 vs baseline. The overhead is minimal (160ms vs 148ms). Try lower alpha to see if we can break even.

## Updated Distillation Table:
| Teacher | Alpha | Loss Type | Sliding BPB | Delta |
|---------|-------|-----------|-------------|-------|
| none | 0.0 | hard labels | 1.1522 | — |
| self (1.15) | 0.5 | hard top-1 | 1.559 | +0.407 |
| self (1.15) | 0.1 | hard top-1 | 1.242 | +0.090 |
| big (1.10) | 0.1 | hard top-1 | 1.242 | +0.090 |
--
## SELFDIST-5b: Soft KL, big teacher, alpha=0.3
- Steps: 3765 @ 159ms
- val_bpb (sliding): 1.1555 (+0.0033 vs baseline)
- Notes: Slightly better than alpha=0.5. Still +0.003 worse than baseline.

| Alpha | Sliding BPB | Delta vs baseline |
|-------|------------|-------------------|
| 0.0 (baseline) | 1.1522 | — |
| 0.3 | 1.1555 | +0.003 |
| 0.5 | 1.1558 | +0.004 |

## SELFDIST-5c: Soft KL, big teacher, alpha=0.1
- Steps: 3773 @ 159ms
- val_bpb (sliding): 1.1553 (+0.0031 vs baseline)

## Final Soft KL Alpha Sweep:
| Alpha | Sliding BPB | Delta |
|-------|------------|-------|
| 0.0 | 1.1522 | — |
| 0.1 | 1.1553 | +0.003 |
| 0.3 | 1.1555 | +0.003 |
| 0.5 | 1.1558 | +0.004 |

**CONCLUSION: Soft KL distillation is uniformly ~0.003 worse than baseline at all alpha values.** The distillation adds ~11ms overhead (159ms vs 148ms = ~7%) which costs ~280 steps. That step penalty roughly equals the distillation benefit, netting out to zero or slightly negative.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"name": "Fielding Johnston",
"github": "fielding",
"val_bpb": 1.1529,
"seeds": {
"42": {"baseline_bpb": 1.1401, "distill_bpb": 1.1529, "baseline_bytes": 15695954, "distill_bytes": 15641113},
"1337": {"baseline_bpb": 1.1386, "distill_bpb": 1.1403, "baseline_bytes": 16149583, "distill_bytes": 15387900, "baseline_over_limit": true},
"2024": {"baseline_bpb": 1.1403, "distill_bpb": 1.1413, "distill_bytes": 16168277, "baseline_bytes": 15632705, "distill_over_limit": true}
},
"val_bpb_note": "Negative result: distillation never beats baseline across 3 seeds. Seeds 1337 baseline and 2024 distillation are slightly over the 16MB artifact limit due to seed-dependent quantization variance; included as auxiliary evidence, not as compliant runs.",
"artifact_bytes": 15641113,
"hardware": "8xH100 SXM",
"training_time_seconds": 600,
"key_techniques": [
"knowledge_distillation",
"self_distillation",
"soft_kl_divergence",
"hard_distillation_top1",
"negative_result"
]
}
Loading