Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Legality audit

## Track constraints

- Training is capped at 600 seconds on 8xH100. The source artifacts stopped at `599.546s`, `599.583s`, and `599.657s`.
- Evaluation is capped at 600 seconds. The final PPM evals took `510.410s`, `500.300s`, and `497.643s`.
- The artifact cap is decimal `16,000,000` bytes. The largest quantized artifact is `15,946,930` bytes; with the current checked-in compressed code wrapper and no local minifier, the largest measured total is `15,995,881` bytes.
- The submitted score uses `TTT_ENABLED=0`; no validation-set gradient update is part of the score.

## PPM causality

The PPM mixer scores each byte from prefix counts and updates the count after scoring the current byte. The gate is computed from already-observed context statistics before incorporating the current target byte.

The byte sidecar is used for BPB accounting and byte-stream scoring alignment. It is not a learned table of validation answers and it is not updated from future bytes.

## Packed document leakage

SmearGate's forward-1 mixing is masked at BOS positions:

```python
not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1)
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1)
```

The same mask is present in both the normal forward path and the TTT forward path.

## Compression and dependencies

The artifact uses per-group `lrzip` compression for grouped int6 tensors and Brotli for the remainder/code wrapper. `lrzip` must be installed in the runtime image before training. The script shells out to an already-installed binary; it does not download packages during evaluation.

## Known review surface

This submission inherits the same review surface as the public SP8192 + byte PPM lane:

- custom SP8192 CaseOps tokenizer/data preparation
- per-token byte sidecar used for exact BPB accounting
- causal PPM eval-time adaptation

The v13-specific final change is only the PPM gate retune to `H=0.999`, `L=0.18`, `T=0.80`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# SP8192 CaseOps v13 PPM tuned gate

This submission consolidates our strongest v13 lane: the SP8192 CaseOps transformer stack with SmearGate BOS masking, per-group `lrzip` compression, and a causal sidecar-aware byte PPM evaluator.

The final score is backed by three fresh end-to-end v13 reruns with the submitted defaults:

```text
PPM_ORDER=5
PPM_H=0.999
PPM_L=0.18
PPM_T=0.80
TTT_ENABLED=0
```

Thanks to Claude for the late-stage experiment design help and to Codex for implementation, audit, run coordination, and packaging. This stack also builds on public Parameter Golf work by @clarkkev, @bigbag, @codemath3000, @OE-GOD, @remg1997, @joshuaswanson, @MarioPaerle, @classiclarryd, @simonbissonnette, @dexhunter, @romeerp, @samacqua, @renqianluo, @jorge-asenjo, @Omrigotlieb, @AnirudhRahul, and @ndokutovich. See `REFERENCES.md` for the component lineage and PR numbers.

## Score

| Seed | Final `ppm_sliding val_bpb` | Artifact bytes | Training stop | Eval time |
|---:|---:|---:|---:|---:|
| 42 | `0.94182660` | `15,987,305` | `4773` steps / `599.686s` | `507.652s` |
| 314 | `0.94146034` | `15,983,753` | `4770` steps / `599.628s` | `516.897s` |
| 999 | `0.94197117` | `15,988,348` | `4772` steps / `599.644s` | `519.029s` |

Three-seed mean:

```text
0.94175270
```

Sample standard deviation:

```text
0.00026331
```

All three fresh artifacts remain under the strict decimal `16,000,000` byte cap. The largest fresh measured artifact plus compressed code wrapper is `15,988,348` bytes.

## What changed

Relative to the previous SP8192 + byte-PPM tuned-gate line, v13 combines:

- CaseOps SP8192 tokenization and byte sidecar accounting for correct `val_bpb` normalization.
- SmearGate with the BOS cross-document leak mask applied in both normal forward and TTT forward paths.
- Per-group `lrzip` compression for banked int6 tensors, with Brotli for the remainder/code wrapper.
- PPM order 5 with the final gate retune `H=0.999`, `L=0.18`, `T=0.80`.
- TTT disabled for the submitted score, so the validation pass is a single causal PPM scoring pass over the quantized artifact.

## Lineage and attribution

This is not a from-scratch model. The code is a consolidation of several public Parameter Golf ideas:

- SP8192 tokenizer, recurrence, QK gain, and compact GPT training lineage from PR #1394, PR #1493, and PR #1855.
- Causal byte-PPM mixer lineage from PR #1795, PR #1959, and PR #1991.
- SmearGate / attention output gate lineage from modded-nanogpt @classiclarryd and PR #1667, plus the BOS cross-document leak fix discussed in PR #2014 / the PR #1797 base audit.
- Per-group `lrzip` compression lineage from PR #1586 through PR #1667 / PR #1729-style grouped serialization work.
- LQER/AWQ/asymmetric-rescale and related quantization/optimization pieces from PR #1530, PR #1797, PR #1886, PR #1923, and PR #1855.
- Online n-gram tilt / scoring overlay ideas from PR #1145 and PR #1967, though the submitted score uses the PPM path rather than TTT.

Our specific contribution in this PR is the v13 consolidation, the CaseOps sidecar-aware evaluation packaging, and the final PPM gate retune to `H=0.999`, `L=0.18`, `T=0.80` over the same seed set.

The checked-in script sets the final PPM gate as defaults, so a fresh run follows the same configuration without external environment overrides.

## Evidence notes

The included `fresh_seed*_v13_submit.log` files are full fresh end-to-end runs with the submitted PPM defaults in `train_gpt.py`. The older `train_seed*.log` and paired `eval_seed*_v13_ppm.log` files are retained as lineage/eval-retune evidence, but the headline score below uses the cleaner fresh rerun set.

```text
seed 42:
stopping_early: wallclock_cap train_time: 599686ms step: 4773/20000
Total submission size quantized+pergroup: 15987305 bytes
diagnostic quantized val_loss:2.35586432 val_bpb:1.07646816 eval_time:10407ms
ppm_mixer val_bpb:0.94182660 eval_time:462353ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36677335 val_bpb:0.94182660 eval_time:507652ms

seed 314:
stopping_early: wallclock_cap train_time: 599628ms step: 4770/20000
Total submission size quantized+pergroup: 15983753 bytes
diagnostic quantized val_loss:2.35632034 val_bpb:1.07667653 eval_time:9243ms
ppm_mixer val_bpb:0.94146034 eval_time:471320ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36627199 val_bpb:0.94146034 eval_time:516897ms

seed 999:
stopping_early: wallclock_cap train_time: 599644ms step: 4772/20000
Total submission size quantized+pergroup: 15988348 bytes
diagnostic quantized val_loss:2.35838976 val_bpb:1.07762211 eval_time:8788ms
ppm_mixer val_bpb:0.94197117 eval_time:473888ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36682950 val_bpb:0.94197117 eval_time:519029ms
```

The earlier eval-only three-seed mean was `0.94174862`; the fresh end-to-end mean is `0.94175270`. The difference is only `0.00000408` bpb, and the fresh set is the cleaner evidence for review.

## Exact final lines

Seed 42:

```text
ppm_mixer val_bpb:0.94151072 eval_time:464892ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36642906 val_bpb:0.94151072 eval_time:510410ms
fresh ppm_mixer val_bpb:0.94182660 eval_time:462353ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
fresh ppm_sliding val_loss:2.36677335 val_bpb:0.94182660 eval_time:507652ms
```

Seed 314:

```text
ppm_mixer val_bpb:0.94180705 eval_time:454770ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36687117 val_bpb:0.94180705 eval_time:500300ms
fresh ppm_mixer val_bpb:0.94146034 eval_time:471320ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
fresh ppm_sliding val_loss:2.36627199 val_bpb:0.94146034 eval_time:516897ms
```

Seed 999:

```text
ppm_mixer val_bpb:0.94192810 eval_time:452193ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36740764 val_bpb:0.94192810 eval_time:497643ms
fresh ppm_mixer val_bpb:0.94197117 eval_time:473888ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
fresh ppm_sliding val_loss:2.36682950 val_bpb:0.94197117 eval_time:519029ms
```

## Included files

- `train_gpt.py` - exact submitted script, with v13 PPM defaults set to `0.999/0.18/0.80`
- `train_seed42.log`, `train_seed314.log`, `train_seed999.log` - source training logs for the three artifacts
- `eval_seed42_v13_ppm.log`, `eval_seed314_v13_ppm.log`, `eval_seed999_v13_ppm.log` - exact v13 PPM score logs
- `fresh_seed42_v13_submit.log` - fresh end-to-end v13 seed-42 rerun with the submitted defaults
- `fresh_seed314_v13_submit.log` - fresh end-to-end v13 seed-314 rerun with the submitted defaults
- `fresh_seed999_v13_submit.log` - fresh end-to-end v13 seed-999 rerun with the submitted defaults
- `submission.json` - leaderboard metadata
- `LEGALITY_AUDIT.md` - compliance audit
- `REFERENCES.md` - public PR and component lineage notes
- `requirements.txt` - Python package/runtime notes
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# References and lineage

This submission builds on public Parameter Golf ideas rather than claiming a new standalone architecture. The list below is intentionally explicit so reviewers can separate inherited code/ideas from our final v13 changes.

## Base model and tokenizer lineage

- PR #1394 by @clarkkev: SP8192, GPTQ embeddings, depth recurrence, MuonEq-R, and related compact GPT training lineage.
- PR #1493 by @bigbag: SP8192 plus 3-layer recurrence, parallel residuals, QK gain 5.25, and the stronger recurrent base used by later SP8192 submissions.
- PR #1855 by @codemath3000: SP8192 plus LQER, sparse attention gate, BOS-fixed SmearGate, and the greedy hyperparameter stack that many late submissions build from.

## PPM / eval-time scoring lineage

- PR #1795 by @OE-GOD: strict-legal causal byte-level PPM adaptive-lambda mixer.
- PR #1959 by @remg1997: SP8192 plus byte-PPM mixer, bridging the PPM idea onto the later SP8192 neural stack.
- PR #1991 by @joshuaswanson: SP8192 + byte-PPM tuned order/gate, `0.94290` three-seed mean. v13 keeps the same core PPM direction and retunes the final gate to `H=0.999`, `L=0.18`, `T=0.80`.
- PR #1145 by @AnirudhRahul and PR #1967 by @ndokutovich: online n-gram tilt / scoring overlay ideas present in the code path, although the submitted score is from the PPM evaluator with TTT disabled.

## SmearGate and leakage fix lineage

- modded-nanogpt @classiclarryd: SmearGate idea referenced in the code comments.
- PR #1667 by @MarioPaerle: SmearGate + attention output gate integration into Parameter Golf.
- PR #1797 by @dexhunter: base audited for the packed-document SmearGate cross-boundary issue.
- PR #2014 by @simonbissonnette: public write-up of the BOS masking fix. v13 includes the BOS mask in both normal forward and TTT forward paths.

## Compression and quantization lineage

- PR #1586 by @dexhunter: per-layer adaptive GPTQ clip / int7 embeddings / MLR direction, referenced by the per-group compression lineage in this code.
- PR #1667 by @MarioPaerle and PR #1729 by @romeerp: per-group `lrzip` / grouped serialization lineage used for the submitted under-cap artifacts.
- PR #1530 by @samacqua: varlen attention, fused MLP, doc-independent TTT, and LQER-related lineage.
- PR #1886 by @renqianluo: fused softcap CE and WD stability notes reflected in comments/hyperparameters.
- PR #1923 by @jorge-asenjo: asymmetric logit rescale and AWQ-lite lineage.
- PR #1344 by @Omrigotlieb: Polar Express Newton-Schulz coefficients used in the optimizer path.

## Our changes

The main contribution here is the v13 consolidation, the sidecar-aware CaseOps evaluation packaging, and the final PPM gate retune:

```text
PPM_ORDER=5
PPM_H=0.999
PPM_L=0.18
PPM_T=0.80
TTT_ENABLED=0
```

Claude helped with late-stage experiment selection and write-up review. Codex handled implementation, audit, run coordination, packaging, and PR preparation.
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
W0501 00:00:12.171000 675063 torch/distributed/run.py:803]
W0501 00:00:12.171000 675063 torch/distributed/run.py:803] *****************************************
W0501 00:00:12.171000 675063 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0501 00:00:12.171000 675063 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
agree_add_boost: 0.5
artifact_dir: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011
attn_clip_sigmas: 13.0
attn_out_gate_enabled: False
attn_out_gate_src: proj
awq_lite_bits: 8
awq_lite_enabled: True
awq_lite_group_size: 64
awq_lite_group_top_k: 1
beta1: 0.9
beta2: 0.99
caseops_enabled: True
compressor: pergroup
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
distributed: True
ema_decay: 0.9965
embed_bits: 7
embed_clip_sigmas: 14.0
embed_lr: 0.6
embed_wd: 0.085
enable_looping_at: 0.35
eval_seq_len: 2048
eval_stride: 512
fused_ce_enabled: True
gate_window: 12
gated_attn_enabled: False
gated_attn_init_std: 0.01
gated_attn_quant_gate: True
global_ttt_batch_seqs: 32
global_ttt_chunk_tokens: 32768
global_ttt_epochs: 1
global_ttt_grad_clip: 1.0
global_ttt_lr: 0.001
global_ttt_momentum: 0.9
global_ttt_respect_doc_boundaries: True
global_ttt_warmup_chunks: 0
global_ttt_warmup_start_lr: 0.0
gptq_calibration_batches: 16
gptq_reserve_seconds: 0.5
grad_accum_steps: 1
grad_clip_norm: 0.3
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011/auto_v13_clean_best_s314_20260501_000011.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
lqer_asym_enabled: True
lqer_asym_group: 64
lqer_enabled: True
lqer_factor_bits: 4
lqer_gain_select: False
lqer_rank: 4
lqer_scope: all
lqer_top_k: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.026
max_wallclock_seconds: 600.0
min_lr: 0.1
mlp_clip_sigmas: 11.5
mlp_mult: 4.0
model_dim: 512
model_path: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011/final_model.pt
muon_backend_steps: 5
muon_momentum: 0.97
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
ngram_hint_precompute_outside: True
ngram_tilt_enabled: True
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_final_lane: mean
parallel_start_layer: 8
phased_ttt_num_phases: 3
phased_ttt_prefix_docs: 2500
ppm_dump_inputs: False
ppm_h: 0.999
ppm_l: 0.18
ppm_mixer_enabled: True
ppm_order: 5
ppm_t: 0.8
qk_gain_init: 5.25
quantized_model_path: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011/final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
rope_yarn: False
run_id: auto_v13_clean_best_s314_20260501_000011
scalar_lr: 0.02
seed: 314
skip_gates_enabled: True
smear_gate_enabled: True
sparse_attn_gate_enabled: True
sparse_attn_gate_init_std: 0.0
sparse_attn_gate_scale: 0.5
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
token_boost: 2.625
token_order: 16
token_threshold: 0.8
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_batch_size: 64
ttt_beta1: 0.0
ttt_beta2: 0.99
ttt_chunk_size: 48
ttt_enabled: False
ttt_eval_batches:
ttt_eval_seq_len: 2048
ttt_grad_steps: 1
ttt_k_lora: True
ttt_local_lr_mult: 0.75
ttt_lora_lr: 0.0001
ttt_lora_rank: 80
ttt_mask: no_qv
ttt_mlp_lora: True
ttt_o_lora: True
ttt_optimizer: adam
ttt_q_lora: False
ttt_train_window_tokens: 0
ttt_v_lora: False
ttt_weight_decay: 0.5
val_batch_tokens: 524288
val_bytes_files: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin
val_doc_fraction: 1.0
val_files: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.85
warmup_steps: 20
within_boost: 0.75
within_tau: 0.45
word_boost: 0.75
word_normalize: strip_punct_lower
word_order: 4
word_tau: 0.65
world_size: 8
xsa_last_n: 11
train_shards: 80
val_tokens: 47851520
TTT_EVAL_ONLY=1 — skipping training + GPTQ, loading saved artifact for TTT eval
ttt_lora_alpha: 144.0
ttt_warm_start_a: True
ttt_weight_decay: 0.5
Deserialize: per-group lrzip decompression...
Deserialize: decompression done in 17.2s
beginning PPM sliding eval
ppm_mixer val_bpb:0.94180705 eval_time:454770ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
ppm_sliding val_loss:2.36687117 val_bpb:0.94180705 eval_time:500300ms
Loading