Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Approach Study: Higher-Rank Output Heads on a Frontier 11L Baseline

This folder documents a non-record study of higher-rank output heads on top of a fixed frontier-aligned 11-layer baseline.

Research question:
- On a strong fixed 11L control, can higher-rank output heads outperform the standard tied softmax head under the 10-minute training budget?

## Summary

- Control: the standard tied head reached `1.1734 val_bpb` in `600s`.
- Result: every tested higher-rank head variant underperformed the control, often by a large margin.
- Artifact impact: mixture heads increased artifact size, while the simplex head reduced artifact size substantially but collapsed score.
- Main finding: on this frontier-aligned small-budget regime, the standard tied head remained the strongest option. Extra output-head structure behaved as an optimization burden rather than a compression win.

## Fixed Baseline

All runs used the same training and evaluation stack:
- 11 layers, `d_model=512`, `8` query heads, `4` KV heads
- `MLP_MULT=3`
- EMA from init (`alpha=0.997`)
- XSA on the last `4` layers
- SmearGate enabled
- BigramHash enabled (`2048` buckets, `128` dim)
- partial RoPE (`16` rotary dims) with NTK-aware scaling
- LN Scale enabled
- VE128 enabled on layers `9,10`
- Late QAT enabled at `lr_scale < 0.15`
- `seq2048`, `786432` train tokens/step
- sliding evaluation (`stride=64`)
- `8xH100`, `600s` wallclock cap
- Hopper FA3, compiled training, and the real quantization/artifact path

Only one family parameter changed across the study:
- output-head type and its local bottleneck settings

## Variants

Tested head family:
- `H0`: standard tied head
- `H1`: factorized head, rank `64`
- `H2`: factorized head, rank `128`
- `H3`: mixture-softmax, `K=2`, rank `64`
- `H4`: mixture-softmax, `K=4`, rank `64`
- `H5`: mixture-softmax, `K=4`, rank `128`
- `H6`: simplex head, bottleneck `128`

## Results

| Run | Head Variant | `val_bpb` | Δ vs `H0` | Steps | Artifact bytes | Notes |
|-----|--------------|----------:|----------:|------:|---------------:|-------|
| `H0` | standard tied head | `1.1734` | `0.0000` | `4415` | `16826913` | control |
| `H1` | factorized `r=64` | `2.4396` | `+1.2662` | `4451` | `16729834` | severe degradation |
| `H2` | factorized `r=128` | `1.9227` | `+0.7494` | `4425` | `16918260` | still far worse than control |
| `H3` | MoS `K=2`, `r=64` | `2.6167` | `+1.4434` | `4428` | `16565348` | severe degradation |
| `H4` | MoS `K=4`, `r=64` | `2.7112` | `+1.5379` | `4149` | `17172588` | worst mixture result |
| `H5` | MoS `K=4`, `r=128` | `2.0898` | `+0.9165` | `4160` | `17943057` | worse score and larger artifact |
| `H6` | simplex `128` | `4.1069` | `+2.9336` | `4241` | `10950817` | smallest artifact, unusable score |

The result is unambiguous: none of the tested higher-rank heads improved the frontier-aligned control, and several failed catastrophically.

## Interpretation

This study does not show that higher-rank output heads are useless in general. It shows something narrower and still useful:
- on this specific frontier-aligned 11L budgeted regime,
- with a strong tied-head baseline already in place,
- extra output-head structure was harder to optimize than the standard head,
- and the added expressivity did not translate into better compression.

The negative result is still useful for future work:
- if this family is revisited, it likely needs a different training regime rather than a direct swap on top of a tuned small-budget control
- the simplex head is notable as an artifact-size reduction idea, but not as a quality-preserving one in this form
- the mixture-head variants were the clearest failure mode: more parameters in the output head did not buy better compression here

## Why There Is No Separate Confirmatory Matrix

Unlike the semantic-tube study, this family sweep was already run on the intended fast path:
- compiled training
- Hopper FA3
- full `80` training shards
- sliding evaluation
- real quantization and artifact generation

So the family sweep itself already serves as the authoritative result set for this study.

## Included Files

Included here:
- `family_heads.jsonl`: raw study results
- `family_heads_review.md`: compact study summary
- `train_gpt.py`: self-contained study-local training script
- `install_flash_attn_hopper.sh`: Hopper-only FA3 installer used by the study runner
- `run_higher_rank_heads_study.sh`: self-contained family runner
- `REPRODUCE.md`: reproduction commands
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Reproduction

The study runner is self-contained:
- it uses the `train_gpt.py` included in this folder
- it ensures the full `fineweb10B_sp1024` dataset is present (`80` training shards)
- it installs or reuses a Hopper-only FA3 wheel before training
- it runs the full 7-variant family on the intended fast path
- it copies the fresh per-run console logs for the rerun into `logs/` inside this folder

## Run The Full Family

```bash
bash records/track_non_record_16mb/2026-03-26_HigherRankHeads_11L_Study/run_higher_rank_heads_study.sh
```

This reruns the full family:
- `H0`: standard tied head
- `H1`: factorized `r=64`
- `H2`: factorized `r=128`
- `H3`: mixture-softmax `K=2`, `r=64`
- `H4`: mixture-softmax `K=4`, `r=64`
- `H5`: mixture-softmax `K=4`, `r=128`
- `H6`: simplex `128`

Expected budget:
- `7` runs
- about `70` minutes total on `8xH100`
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{"id": 1, "timestamp": "2026-03-26T18:28:50Z", "approach": "head_family", "variation": "control_standard", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "0", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "standard"}, "fn_injection": null, "val_bpb": 1.1733618, "val_loss": 1.97221607, "peak_memory_mb": 26709, "training_time_ms": 600059, "num_params": 26993766, "num_steps": 4415, "artifact_size_bytes": 16826913, "artifact_est_bytes": 0, "status": "pending", "reward": 0.5, "description": "H0: Control baseline (standard head)", "tags": []}
{"id": 2, "timestamp": "2026-03-26T18:41:35Z", "approach": "head_family", "variation": "factorized_r64", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "64", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "standard"}, "fn_injection": null, "val_bpb": 2.43959467, "val_loss": 4.10053219, "peak_memory_mb": 26725, "training_time_ms": 604650, "num_params": 27092070, "num_steps": 4451, "artifact_size_bytes": 16729834, "artifact_est_bytes": 0, "status": "pending", "reward": -0.5, "description": "H1: Factorized head rank=64", "tags": []}
{"id": 3, "timestamp": "2026-03-26T18:54:01Z", "approach": "head_family", "variation": "factorized_r128", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "128", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "standard"}, "fn_injection": null, "val_bpb": 1.92273838, "val_loss": 3.23178711, "peak_memory_mb": 26736, "training_time_ms": 599968, "num_params": 27190374, "num_steps": 4425, "artifact_size_bytes": 16918260, "artifact_est_bytes": 0, "status": "pending", "reward": -0.5, "description": "H2: Factorized head rank=128", "tags": []}
{"id": 4, "timestamp": "2026-03-26T19:06:33Z", "approach": "head_family", "variation": "mos_k2_r64", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "0", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "mixture_softmax", "MIXTURE_SOFTMAX_K": "2", "MIXTURE_RANK_DIM": "64"}, "fn_injection": null, "val_bpb": 2.61671791, "val_loss": 4.39824539, "peak_memory_mb": 27889, "training_time_ms": 600007, "num_params": 27191398, "num_steps": 4428, "artifact_size_bytes": 16565348, "artifact_est_bytes": 0, "status": "pending", "reward": -0.5, "description": "H3: Mixture softmax K=2 rank=64", "tags": []}
{"id": 5, "timestamp": "2026-03-26T19:19:17Z", "approach": "head_family", "variation": "mos_k4_r64", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "0", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "mixture_softmax", "MIXTURE_SOFTMAX_K": "4", "MIXTURE_RANK_DIM": "64"}, "fn_injection": null, "val_bpb": 2.7112436, "val_loss": 4.55712656, "peak_memory_mb": 28686, "training_time_ms": 599989, "num_params": 27389030, "num_steps": 4149, "artifact_size_bytes": 17172588, "artifact_est_bytes": 0, "status": "pending", "reward": -0.5, "description": "H4: Mixture softmax K=4 rank=64", "tags": []}
{"id": 6, "timestamp": "2026-03-26T19:31:47Z", "approach": "head_family", "variation": "mos_k4_r128", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "0", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "mixture_softmax", "MIXTURE_SOFTMAX_K": "4", "MIXTURE_RANK_DIM": "128"}, "fn_injection": null, "val_bpb": 2.08984766, "val_loss": 3.51266861, "peak_memory_mb": 28738, "training_time_ms": 600000, "num_params": 27782246, "num_steps": 4160, "artifact_size_bytes": 17943057, "artifact_est_bytes": 0, "status": "pending", "reward": -0.5, "description": "H5: Mixture softmax K=4 rank=128", "tags": []}
{"id": 7, "timestamp": "2026-03-26T19:44:13Z", "approach": "head_family", "variation": "simplex_128", "subvariation": "seed42", "commit": "76ceb8e", "env_vars": {"NUM_LAYERS": "11", "MODEL_DIM": "512", "NUM_HEADS": "8", "NUM_KV_HEADS": "4", "MLP_MULT": "3", "MATRIX_LR": "0.025", "SCALAR_LR": "0.025", "MUON_MOMENTUM": "0.99", "SKIP_QUANT": "0", "MAX_WALLCLOCK_SECONDS": "600", "LAMBDA_SPECTRAL": "0.0", "LAMBDA_TUBE": "0.0", "RANK_DIM": "0", "ATTENTION_TYPE": "standard", "EVAL_STRIDE": "64", "WARMDOWN_ITERS": "3500", "COMPRESS_ALGO": "zstd", "QUANT_BITS_MIDDLE": "6", "NUM_GPUS": "8", "USE_QAT": "0", "QAT_BITS": "6", "TRAIN_SEQ_LEN": "2048", "SMEAR_GATE": "1", "BIGRAM_HASH": "1", "BIGRAM_VOCAB_SIZE": "2048", "BIGRAM_DIM": "128", "MUON_WEIGHT_DECAY": "0.04", "ADAM_WD": "0.04", "INIT_TYPE": "ortho", "USE_EMA": "1", "EMA_ALPHA": "0.997", "USE_NTK_ROPE": "1", "USE_FA3": "1", "FLASH_ATTN_BACKEND": "fa3", "FLASH_ATTN_STRICT": "1", "FLASH_ATTN_ARCH_LIST": "9.0", "QUANT_BITS_MLP": "6", "TIED_EMBED_LR": "0.035", "MUON_MOMENTUM_WARMUP_START": "0.92", "MUON_MOMENTUM_WARMUP_STEPS": "1500", "GRAD_CLIP_NORM": "0.3", "USE_TTT": "0", "XSA_LAST_N": "4", "ROPE_DIMS": "16", "LN_SCALE": "1", "LATE_QAT": "1", "QAT_THRESHOLD": "0.15", "TRAIN_BATCH_TOKENS": "786432", "EMA_FROM_INIT": "1", "GRAD_GUIDED_QUANT": "0", "BACKOUT_CONNECTION": "0", "SPECTRAL_WARMDOWN": "0", "VALUE_EMBED": "1", "VE_DIM": "128", "VE_LAYERS": "9,10", "TOKEN_SALIENCY": "0", "QUANT_ERROR_DIFFUSION": "0", "QUANT_HADAMARD": "0", "LAMBDA_VICREG": "0.0", "HYPERGRAPH_LIFT": "0", "HYPERGRAPH_SCALES": "2,4,8", "HEAD_TYPE": "simplex", "SIMPLEX_DIM": "128"}, "fn_injection": null, "val_bpb": 4.10691873, "val_loss": 6.90301248, "peak_memory_mb": 26760, "training_time_ms": 599973, "num_params": 27190374, "num_steps": 4241, "artifact_size_bytes": 10950817, "artifact_est_bytes": 0, "status": "pending", "reward": -0.5, "description": "H6: Simplex head dim=128", "tags": []}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Higher-Rank Output Heads Family Study

- Family: `higher_rank_heads`
- Source JSONL: `family_heads.jsonl`
- Runs: `7`
- Best result: `control_standard` at `1.1734 val_bpb`

## Fixed Baseline

11L/512d fixed backbone with EMA, XSA4, SmearGate, BigramHash, partial RoPE, LN Scale, VE128 on late layers, Late QAT, `seq2048`, Hopper FA3, compiled training, sliding evaluation, and the real quantization/artifact path.

## Results

| ID | Variation | `val_bpb` | Steps | Time (s) | Artifact bytes |
|---:|---|---:|---:|---:|---:|
| 1 | control_standard | 1.1734 | 4415 | 600.1 | 16826913 |
| 2 | factorized_r64 | 2.4396 | 4451 | 604.6 | 16729834 |
| 3 | factorized_r128 | 1.9227 | 4425 | 600.0 | 16918260 |
| 4 | mos_k2_r64 | 2.6167 | 4428 | 600.0 | 16565348 |
| 5 | mos_k4_r64 | 2.7112 | 4149 | 600.0 | 17172588 |
| 6 | mos_k4_r128 | 2.0898 | 4160 | 600.0 | 17943057 |
| 7 | simplex_128 | 4.1069 | 4241 | 600.0 | 10950817 |

## Main Finding

The standard tied head outperformed every tested higher-rank alternative on this frontier-aligned 11L baseline. The simplex head reduced artifact size substantially but at an unusable quality cost. The mixture-softmax variants were both worse in score and, for the larger mixtures, larger in artifact size.
Loading