openai · NewyorkDev · May 1, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/...ds/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/LEGALITY_AUDIT.md b/...ds/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/LEGALITY_AUDIT.md
@@ -0,0 +1,39 @@
+# Legality audit
+
+## Track constraints
+
+- Training is capped at 600 seconds on 8xH100. The source artifacts stopped at `599.546s`, `599.583s`, and `599.657s`.
+- Evaluation is capped at 600 seconds. The final PPM evals took `510.410s`, `500.300s`, and `497.643s`.
+- The artifact cap is decimal `16,000,000` bytes. The largest quantized artifact is `15,946,930` bytes; with the current checked-in compressed code wrapper and no local minifier, the largest measured total is `15,995,881` bytes.
+- The submitted score uses `TTT_ENABLED=0`; no validation-set gradient update is part of the score.
+
+## PPM causality
+
+The PPM mixer scores each byte from prefix counts and updates the count after scoring the current byte. The gate is computed from already-observed context statistics before incorporating the current target byte.
+
+The byte sidecar is used for BPB accounting and byte-stream scoring alignment. It is not a learned table of validation answers and it is not updated from future bytes.
+
+## Packed document leakage
+
+SmearGate's forward-1 mixing is masked at BOS positions:
+
+```python
+not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1)
+x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1)
+```
+
+The same mask is present in both the normal forward path and the TTT forward path.
+
+## Compression and dependencies
+
+The artifact uses per-group `lrzip` compression for grouped int6 tensors and Brotli for the remainder/code wrapper. `lrzip` must be installed in the runtime image before training. The script shells out to an already-installed binary; it does not download packages during evaluation.
+
+## Known review surface
+
+This submission inherits the same review surface as the public SP8192 + byte PPM lane:
+
+- custom SP8192 CaseOps tokenizer/data preparation
+- per-token byte sidecar used for exact BPB accounting
+- causal PPM eval-time adaptation
+
+The v13-specific final change is only the PPM gate retune to `H=0.999`, `L=0.18`, `T=0.80`.
diff --git a/records/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/README.md b/records/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/README.md
@@ -0,0 +1,133 @@
+# SP8192 CaseOps v13 PPM tuned gate
+
+This submission consolidates our strongest v13 lane: the SP8192 CaseOps transformer stack with SmearGate BOS masking, per-group `lrzip` compression, and a causal sidecar-aware byte PPM evaluator.
+
+The final score is backed by three fresh end-to-end v13 reruns with the submitted defaults:
+
+```text
+PPM_ORDER=5
+PPM_H=0.999
+PPM_L=0.18
+PPM_T=0.80
+TTT_ENABLED=0
+```
+
+Thanks to Claude for the late-stage experiment design help and to Codex for implementation, audit, run coordination, and packaging. This stack also builds on public Parameter Golf work by @clarkkev, @bigbag, @codemath3000, @OE-GOD, @remg1997, @joshuaswanson, @MarioPaerle, @classiclarryd, @simonbissonnette, @dexhunter, @romeerp, @samacqua, @renqianluo, @jorge-asenjo, @Omrigotlieb, @AnirudhRahul, and @ndokutovich. See `REFERENCES.md` for the component lineage and PR numbers.
+
+## Score
+
+| Seed | Final `ppm_sliding val_bpb` | Artifact bytes | Training stop | Eval time |
+|---:|---:|---:|---:|---:|
+| 42 | `0.94182660` | `15,987,305` | `4773` steps / `599.686s` | `507.652s` |
+| 314 | `0.94146034` | `15,983,753` | `4770` steps / `599.628s` | `516.897s` |
+| 999 | `0.94197117` | `15,988,348` | `4772` steps / `599.644s` | `519.029s` |
+
+Three-seed mean:
+
+```text
+0.94175270
+```
+
+Sample standard deviation:
+
+```text
+0.00026331
+```
+
+All three fresh artifacts remain under the strict decimal `16,000,000` byte cap. The largest fresh measured artifact plus compressed code wrapper is `15,988,348` bytes.
+
+## What changed
+
+Relative to the previous SP8192 + byte-PPM tuned-gate line, v13 combines:
+
+- CaseOps SP8192 tokenization and byte sidecar accounting for correct `val_bpb` normalization.
+- SmearGate with the BOS cross-document leak mask applied in both normal forward and TTT forward paths.
+- Per-group `lrzip` compression for banked int6 tensors, with Brotli for the remainder/code wrapper.
+- PPM order 5 with the final gate retune `H=0.999`, `L=0.18`, `T=0.80`.
+- TTT disabled for the submitted score, so the validation pass is a single causal PPM scoring pass over the quantized artifact.
+
+## Lineage and attribution
+
+This is not a from-scratch model. The code is a consolidation of several public Parameter Golf ideas:
+
+- SP8192 tokenizer, recurrence, QK gain, and compact GPT training lineage from PR #1394, PR #1493, and PR #1855.
+- Causal byte-PPM mixer lineage from PR #1795, PR #1959, and PR #1991.
+- SmearGate / attention output gate lineage from modded-nanogpt @classiclarryd and PR #1667, plus the BOS cross-document leak fix discussed in PR #2014 / the PR #1797 base audit.
+- Per-group `lrzip` compression lineage from PR #1586 through PR #1667 / PR #1729-style grouped serialization work.
+- LQER/AWQ/asymmetric-rescale and related quantization/optimization pieces from PR #1530, PR #1797, PR #1886, PR #1923, and PR #1855.
+- Online n-gram tilt / scoring overlay ideas from PR #1145 and PR #1967, though the submitted score uses the PPM path rather than TTT.
+
+Our specific contribution in this PR is the v13 consolidation, the CaseOps sidecar-aware evaluation packaging, and the final PPM gate retune to `H=0.999`, `L=0.18`, `T=0.80` over the same seed set.
+
+The checked-in script sets the final PPM gate as defaults, so a fresh run follows the same configuration without external environment overrides.
+
+## Evidence notes
+
+The included `fresh_seed*_v13_submit.log` files are full fresh end-to-end runs with the submitted PPM defaults in `train_gpt.py`. The older `train_seed*.log` and paired `eval_seed*_v13_ppm.log` files are retained as lineage/eval-retune evidence, but the headline score below uses the cleaner fresh rerun set.
+
+```text
+seed 42:
+stopping_early: wallclock_cap train_time: 599686ms step: 4773/20000
+Total submission size quantized+pergroup: 15987305 bytes
+diagnostic quantized val_loss:2.35586432 val_bpb:1.07646816 eval_time:10407ms
+ppm_mixer val_bpb:0.94182660 eval_time:462353ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36677335 val_bpb:0.94182660 eval_time:507652ms
+
+seed 314:
+stopping_early: wallclock_cap train_time: 599628ms step: 4770/20000
+Total submission size quantized+pergroup: 15983753 bytes
+diagnostic quantized val_loss:2.35632034 val_bpb:1.07667653 eval_time:9243ms
+ppm_mixer val_bpb:0.94146034 eval_time:471320ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36627199 val_bpb:0.94146034 eval_time:516897ms
+
+seed 999:
+stopping_early: wallclock_cap train_time: 599644ms step: 4772/20000
+Total submission size quantized+pergroup: 15988348 bytes
+diagnostic quantized val_loss:2.35838976 val_bpb:1.07762211 eval_time:8788ms
+ppm_mixer val_bpb:0.94197117 eval_time:473888ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36682950 val_bpb:0.94197117 eval_time:519029ms
+```
+
+The earlier eval-only three-seed mean was `0.94174862`; the fresh end-to-end mean is `0.94175270`. The difference is only `0.00000408` bpb, and the fresh set is the cleaner evidence for review.
+
+## Exact final lines
+
+Seed 42:
+
+```text
+ppm_mixer val_bpb:0.94151072 eval_time:464892ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36642906 val_bpb:0.94151072 eval_time:510410ms
+fresh ppm_mixer val_bpb:0.94182660 eval_time:462353ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+fresh ppm_sliding val_loss:2.36677335 val_bpb:0.94182660 eval_time:507652ms
+```
+
+Seed 314:
+
+```text
+ppm_mixer val_bpb:0.94180705 eval_time:454770ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36687117 val_bpb:0.94180705 eval_time:500300ms
+fresh ppm_mixer val_bpb:0.94146034 eval_time:471320ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+fresh ppm_sliding val_loss:2.36627199 val_bpb:0.94146034 eval_time:516897ms
+```
+
+Seed 999:
+
+```text
+ppm_mixer val_bpb:0.94192810 eval_time:452193ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36740764 val_bpb:0.94192810 eval_time:497643ms
+fresh ppm_mixer val_bpb:0.94197117 eval_time:473888ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+fresh ppm_sliding val_loss:2.36682950 val_bpb:0.94197117 eval_time:519029ms
+```
+
+## Included files
+
+- `train_gpt.py` - exact submitted script, with v13 PPM defaults set to `0.999/0.18/0.80`
+- `train_seed42.log`, `train_seed314.log`, `train_seed999.log` - source training logs for the three artifacts
+- `eval_seed42_v13_ppm.log`, `eval_seed314_v13_ppm.log`, `eval_seed999_v13_ppm.log` - exact v13 PPM score logs
+- `fresh_seed42_v13_submit.log` - fresh end-to-end v13 seed-42 rerun with the submitted defaults
+- `fresh_seed314_v13_submit.log` - fresh end-to-end v13 seed-314 rerun with the submitted defaults
+- `fresh_seed999_v13_submit.log` - fresh end-to-end v13 seed-999 rerun with the submitted defaults
+- `submission.json` - leaderboard metadata
+- `LEGALITY_AUDIT.md` - compliance audit
+- `REFERENCES.md` - public PR and component lineage notes
+- `requirements.txt` - Python package/runtime notes
diff --git a/records/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/REFERENCES.md b/records/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/REFERENCES.md
@@ -0,0 +1,46 @@
+# References and lineage
+
+This submission builds on public Parameter Golf ideas rather than claiming a new standalone architecture. The list below is intentionally explicit so reviewers can separate inherited code/ideas from our final v13 changes.
+
+## Base model and tokenizer lineage
+
+- PR #1394 by @clarkkev: SP8192, GPTQ embeddings, depth recurrence, MuonEq-R, and related compact GPT training lineage.
+- PR #1493 by @bigbag: SP8192 plus 3-layer recurrence, parallel residuals, QK gain 5.25, and the stronger recurrent base used by later SP8192 submissions.
+- PR #1855 by @codemath3000: SP8192 plus LQER, sparse attention gate, BOS-fixed SmearGate, and the greedy hyperparameter stack that many late submissions build from.
+
+## PPM / eval-time scoring lineage
+
+- PR #1795 by @OE-GOD: strict-legal causal byte-level PPM adaptive-lambda mixer.
+- PR #1959 by @remg1997: SP8192 plus byte-PPM mixer, bridging the PPM idea onto the later SP8192 neural stack.
+- PR #1991 by @joshuaswanson: SP8192 + byte-PPM tuned order/gate, `0.94290` three-seed mean. v13 keeps the same core PPM direction and retunes the final gate to `H=0.999`, `L=0.18`, `T=0.80`.
+- PR #1145 by @AnirudhRahul and PR #1967 by @ndokutovich: online n-gram tilt / scoring overlay ideas present in the code path, although the submitted score is from the PPM evaluator with TTT disabled.
+
+## SmearGate and leakage fix lineage
+
+- modded-nanogpt @classiclarryd: SmearGate idea referenced in the code comments.
+- PR #1667 by @MarioPaerle: SmearGate + attention output gate integration into Parameter Golf.
+- PR #1797 by @dexhunter: base audited for the packed-document SmearGate cross-boundary issue.
+- PR #2014 by @simonbissonnette: public write-up of the BOS masking fix. v13 includes the BOS mask in both normal forward and TTT forward paths.
+
+## Compression and quantization lineage
+
+- PR #1586 by @dexhunter: per-layer adaptive GPTQ clip / int7 embeddings / MLR direction, referenced by the per-group compression lineage in this code.
+- PR #1667 by @MarioPaerle and PR #1729 by @romeerp: per-group `lrzip` / grouped serialization lineage used for the submitted under-cap artifacts.
+- PR #1530 by @samacqua: varlen attention, fused MLP, doc-independent TTT, and LQER-related lineage.
+- PR #1886 by @renqianluo: fused softcap CE and WD stability notes reflected in comments/hyperparameters.
+- PR #1923 by @jorge-asenjo: asymmetric logit rescale and AWQ-lite lineage.
+- PR #1344 by @Omrigotlieb: Polar Express Newton-Schulz coefficients used in the optimizer path.
+
+## Our changes
+
+The main contribution here is the v13 consolidation, the sidecar-aware CaseOps evaluation packaging, and the final PPM gate retune:
+
+```text
+PPM_ORDER=5
+PPM_H=0.999
+PPM_L=0.18
+PPM_T=0.80
+TTT_ENABLED=0
+```
+
+Claude helped with late-stage experiment selection and write-up review. Codex handled implementation, audit, run coordination, packaging, and PR preparation.
diff --git a/records/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/eval_seed314_v13_ppm.log b/records/track_10min_16mb/2026-05-01_SP8192_CaseOps_V13_PPM_0.94175/eval_seed314_v13_ppm.log
@@ -0,0 +1,169 @@
+W0501 00:00:12.171000 675063 torch/distributed/run.py:803] 
+W0501 00:00:12.171000 675063 torch/distributed/run.py:803] *****************************************
+W0501 00:00:12.171000 675063 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0501 00:00:12.171000 675063 torch/distributed/run.py:803] *****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  agree_add_boost: 0.5
+  artifact_dir: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011
+  attn_clip_sigmas: 13.0
+  attn_out_gate_enabled: False
+  attn_out_gate_src: proj
+  awq_lite_bits: 8
+  awq_lite_enabled: True
+  awq_lite_group_size: 64
+  awq_lite_group_top_k: 1
+  beta1: 0.9
+  beta2: 0.99
+  caseops_enabled: True
+  compressor: pergroup
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 7
+  embed_clip_sigmas: 14.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  enable_looping_at: 0.35
+  eval_seq_len: 2048
+  eval_stride: 512
+  fused_ce_enabled: True
+  gate_window: 12
+  gated_attn_enabled: False
+  gated_attn_init_std: 0.01
+  gated_attn_quant_gate: True
+  global_ttt_batch_seqs: 32
+  global_ttt_chunk_tokens: 32768
+  global_ttt_epochs: 1
+  global_ttt_grad_clip: 1.0
+  global_ttt_lr: 0.001
+  global_ttt_momentum: 0.9
+  global_ttt_respect_doc_boundaries: True
+  global_ttt_warmup_chunks: 0
+  global_ttt_warmup_start_lr: 0.0
+  gptq_calibration_batches: 16
+  gptq_reserve_seconds: 0.5
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011/auto_v13_clean_best_s314_20260501_000011.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  lqer_asym_enabled: True
+  lqer_asym_group: 64
+  lqer_enabled: True
+  lqer_factor_bits: 4
+  lqer_gain_select: False
+  lqer_rank: 4
+  lqer_scope: all
+  lqer_top_k: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.026
+  max_wallclock_seconds: 600.0
+  min_lr: 0.1
+  mlp_clip_sigmas: 11.5
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011/final_model.pt
+  muon_backend_steps: 5
+  muon_momentum: 0.97
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  ngram_hint_precompute_outside: True
+  ngram_tilt_enabled: True
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_final_lane: mean
+  parallel_start_layer: 8
+  phased_ttt_num_phases: 3
+  phased_ttt_prefix_docs: 2500
+  ppm_dump_inputs: False
+  ppm_h: 0.999
+  ppm_l: 0.18
+  ppm_mixer_enabled: True
+  ppm_order: 5
+  ppm_t: 0.8
+  qk_gain_init: 5.25
+  quantized_model_path: /workspace/parameter-golf/our_submission/1000/runs/auto_v13_clean_best_s314_20260501_000011/final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  rope_yarn: False
+  run_id: auto_v13_clean_best_s314_20260501_000011
+  scalar_lr: 0.02
+  seed: 314
+  skip_gates_enabled: True
+  smear_gate_enabled: True
+  sparse_attn_gate_enabled: True
+  sparse_attn_gate_init_std: 0.0
+  sparse_attn_gate_scale: 0.5
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  token_boost: 2.625
+  token_order: 16
+  token_threshold: 0.8
+  tokenizer_path: ./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_batch_size: 64
+  ttt_beta1: 0.0
+  ttt_beta2: 0.99
+  ttt_chunk_size: 48
+  ttt_enabled: False
+  ttt_eval_batches: 
+  ttt_eval_seq_len: 2048
+  ttt_grad_steps: 1
+  ttt_k_lora: True
+  ttt_local_lr_mult: 0.75
+  ttt_lora_lr: 0.0001
+  ttt_lora_rank: 80
+  ttt_mask: no_qv
+  ttt_mlp_lora: True
+  ttt_o_lora: True
+  ttt_optimizer: adam
+  ttt_q_lora: False
+  ttt_train_window_tokens: 0
+  ttt_v_lora: False
+  ttt_weight_decay: 0.5
+  val_batch_tokens: 524288
+  val_bytes_files: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_bytes_*.bin
+  val_doc_fraction: 1.0
+  val_files: ./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.85
+  warmup_steps: 20
+  within_boost: 0.75
+  within_tau: 0.45
+  word_boost: 0.75
+  word_normalize: strip_punct_lower
+  word_order: 4
+  word_tau: 0.65
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 80
+val_tokens: 47851520
+TTT_EVAL_ONLY=1 — skipping training + GPTQ, loading saved artifact for TTT eval
+ttt_lora_alpha: 144.0
+ttt_warm_start_a: True
+ttt_weight_decay: 0.5
+Deserialize: per-group lrzip decompression...
+Deserialize: decompression done in 17.2s
+beginning PPM sliding eval
+ppm_mixer val_bpb:0.94180705 eval_time:454770ms order=5 H=0.999 L=0.18 T=0.8 N_tokens=47851520 N_sidecar_bytes=151074499
+ppm_sliding val_loss:2.36687117 val_bpb:0.94180705 eval_time:500300ms