openai · someone114514 · Apr 27, 2026
diff --git a/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/README.md b/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/README.md
@@ -0,0 +1,117 @@
+# Record: SP8192 + Strict Full-Val Byte PPM Mixture
+
+**val_bpb = 1.0049** (3-seed mean, std 0.0007) | **~15.995 MB** | 8xH100 SXM
+
+This submission starts from the merged 2026-04-09 SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 base stack, removes eval-time TTT from the packed script, and adds a strict full-validation byte-level PPM mixture in `eval_val_sliding`.
+
+## 3-Seed Results
+
+| Seed | Post-EMA BPB | **PPM BPB** | Artifact |
+|------|-------------:|------------:|---------:|
+| 42   | 1.0871 | **1.0049** | 15,997,433 |
+| 7    | 1.0875 | **1.0057** | 15,995,226 |
+| 1337 | 1.0863 | **1.0043** | 15,993,603 |
+| **Mean** | **1.0869** | **1.0049** | **15,995,421** |
+| **Std** | | **0.0007** | |
+
+Compared to the 2026-04-09 base record's legal TTT mean of 1.0810 BPB, this strict PPM mixture improves by **0.0761 BPB**. The plain NN scores reported by the scorer (`nn_token_bpb`) remain around 1.0795-1.0812; the gain is from the online byte PPM mixture.
+
+## Key Techniques
+
+1. **SP8192 base stack** - inherits the merged SP8192 + GPTQ SDClip + 3-layer recurrence + parallel residuals + QK-gain 5.25 architecture and training recipe.
+2. **Strict full-val byte PPM** - reconstructs the byte stream from already-scored target tokens and SentencePiece byte LUTs, then scores every byte with a prefix-only PPM model.
+3. **Prefix-only binary gate** - chooses the NN/PPM mixture lambda from context confidence before observing the current byte, avoiding target-conditioned gate selection.
+4. **Score-before-update byte order** - every byte is scored from previous bytes only, then inserted into the PPM tables for future bytes.
+5. **Native C scorer** - runtime-compiled open-addressed context tables, rolling context keys, inline byte counts, fixed order-0 counts, cached integer logs, and precomputed lambda logs.
+6. **Compact sliding collection** - per-rank raw token/NLL files in `/tmp`; rank 0 gathers and runs the strict sequential PPM scorer. No full-length GPU position buffers or large NCCL all-reduces.
+7. **Eval-time controls for budget** - `SKIP_QUANTIZED_EVAL=1`, `SLIDING_BATCH_SEQS=32`, and `PPM_LOG_CACHE_SIZE=1048576` keep full eval under 600s.
+
+## Architecture And Training
+
+The neural base is unchanged from the 2026-04-09 SP8192 record: 11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, partial RoPE, layerwise LN scale, tied embeddings, logit softcap=30, depth recurrence over layers 3-5, and parallel residuals from layer 7.
+
+Training uses the inherited MuonEq-R/AdamW recipe, EMA 0.9965, WD 0.095, matrix LR 0.022, warmdown 0.72, and a 588s effective train cap (`MAX_WALLCLOCK_SECONDS=600`, `GPTQ_RESERVE_SECONDS=12`).
+
+## Quantization
+
+Full-Hessian GPTQ with SDClip is unchanged from the base stack: int6 attention/MLP matrices, int8 token embeddings, float16 scalar/gating parameters, byte-shuffle, and Brotli-11 compression. The packed script is trimmed to 21.4KB by removing TTT and the Python PPM reference from the final artifact.
+
+## Strict PPM Evaluation
+
+For each scored target token, the scorer spreads the NN token NLL uniformly over that token's emitted bytes. If a SentencePiece leading-space marker should contribute an actual space byte, that byte is scored first. For every byte:
+
+1. Build context keys from previous bytes only.
+2. Score the byte with PPM-D style escape probabilities.
+3. Compute context confidence as `max_count / (total + unique)` at the deepest available prefix context.
+4. Use `lambda_lo` if confidence is high, otherwise `lambda_hi`.
+5. Mix normalized NN byte probability and PPM byte probability.
+6. Update byte counts after scoring.
+
+Default parameters used for the record logs:
+
+| Env var | Value |
+|---|---:|
+| `PPM_ORDER` | `4` |
+| `PPM_LAMBDA_HI` | `0.9` |
+| `PPM_LAMBDA_LO` | `0.05` |
+| `PPM_CONF_THRESHOLD` | `0.9` |
+| `PPM_LOG_CACHE_SIZE` | `1048576` |
+| `SKIP_QUANTIZED_EVAL` | `1` |
+| `SLIDING_BATCH_SEQS` | `32` |
+
+## Compliance
+
+Per Issue #1017-style eval-time constraints:
+
+- **Causality:** The neural model is evaluated by causal sliding windows. The byte PPM table only contains previous bytes at the time each byte is scored.
+- **Score before update:** PPM counts are updated only after the current byte's mixed log-probability is recorded.
+- **Full validation:** Formal logs use all 40,540,160 scored target tokens / 151,078,222 bytes. Debug subsets are non-scoring.
+- **Single scoring path:** The returned `quantized_sliding_window` BPB is the PPM mixture score for the full stream; there is no post-hoc best-of selection.
+- **No SLOT.**
+- **No TTT in the packed artifact.**
+- **No pre-quant validation adaptation.**
+- **No ETLB/logit bias.**
+- **No n-gram cache or precomputed validation cache.**
+- **Artifact under 16,000,000 bytes on all three seeds.**
+- **Training under 600s on all three seeds.**
+- **PPM eval under 600s on all three seeds.**
+
+Review note: this is a byte-level online mixture scoring object rather than a pure token-level NN score. The logs report `nn_token_bpb`, `nn_byte_bpb`, `ppm_only`, and `mix_bpb` for auditability.
+
+## Reproduction
+
+```bash
+python3 -m pip install brotli sentencepiece
+# install the same flash-attention package used by the base SP8192 records if missing
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+
+RUN_ID=strict_ppm_trim_seed42_8gpu_order4_b32 \
+SEED=42 \
+PPM_ENABLED=1 \
+PPM_NATIVE_ENABLED=1 \
+PPM_ORDER=4 \
+PPM_LAMBDA_HI=0.9 \
+PPM_LAMBDA_LO=0.05 \
+PPM_CONF_THRESHOLD=0.9 \
+PPM_LOG_CACHE_SIZE=1048576 \
+SKIP_QUANTIZED_EVAL=1 \
+SLIDING_BATCH_SEQS=32 \
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py
+```
+
+Change `SEED` and `RUN_ID` for seeds 7 and 1337.
+
+## Credits
+
+- Base stack: merged 2026-04-09 SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 + legal TTT record and credited lineage.
+- PPM idea lineage: PR #1835 / #1795 discussion. This version changes the implementation to strict full-val scoring, prefix-only gating, native sequential scoring, and no subset claim.
+
+## Included Files
+
+- `README.md`
+- `submission.json`
+- `train_gpt.py`
+- `train_seed42.log`
+- `train_seed7.log`
+- `train_seed1337.log`
diff --git a/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/submission.json b/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/submission.json
@@ -0,0 +1,70 @@
+{
+  "author": "someone114514",
+  "github_id": "someone114514",
+  "name": "SP8192 + Strict Full-Val Byte PPM Mixture",
+  "date": "2026-04-26",
+  "track": "10min_16mb",
+  "val_bpb": 1.00495,
+  "val_bpb_std": 0.00072,
+  "seeds": [42, 7, 1337],
+  "seed_results": {
+    "42": {
+      "val_bpb": 1.00489563,
+      "pre_quant_bpb": 1.08711004,
+      "artifact_bytes": 15997433,
+      "train_time_s": 588.065,
+      "eval_time_s": 393.717
+    },
+    "7": {
+      "val_bpb": 1.00569239,
+      "pre_quant_bpb": 1.08750246,
+      "artifact_bytes": 15995226,
+      "train_time_s": 588.142,
+      "eval_time_s": 343.606
+    },
+    "1337": {
+      "val_bpb": 1.00425333,
+      "pre_quant_bpb": 1.08627037,
+      "artifact_bytes": 15993603,
+      "train_time_s": 588.086,
+      "eval_time_s": 346.457
+    }
+  },
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "base_record": "2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT",
+  "base_val_bpb_ttt": 1.08100,
+  "base_val_bpb_sliding": 1.08270,
+  "technique_summary": "SP8192 base stack + strict full-validation byte-level PPM-D mixture in eval_val_sliding, prefix-only confidence gate, score-before-update byte counts, native C scorer, compact raw per-rank collection, and no TTT in the packed artifact.",
+  "new_controls": {
+    "PPM_ENABLED": "0 by default; set to 1 to enable strict PPM mixture",
+    "PPM_ORDER": 4,
+    "PPM_LAMBDA_HI": 0.9,
+    "PPM_LAMBDA_LO": 0.05,
+    "PPM_CONF_THRESHOLD": 0.9,
+    "PPM_DEBUG_SUBSET_TOKENS": "0 for formal full-val scoring; positive values are debug-only and must not be submitted",
+    "PPM_NATIVE_ENABLED": "1 by default; runtime-compiled C scorer",
+    "PPM_LOG_CACHE_SIZE": 1048576,
+    "SKIP_QUANTIZED_EVAL": "1 in submitted logs; skips plain quantized eval",
+    "SLIDING_BATCH_SEQS": 32
+  },
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_ttt_in_packed_artifact": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_byte_ppm": true,
+    "prefix_only_gate": true,
+    "full_val_required_for_claimed_score": true,
+    "three_seeds": true
+  },
+  "attribution": {
+    "base_stack": "2026-04-09 SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 + legal TTT record and credited lineage",
+    "ppm_idea": "PR #1835 / #1795 discussion; this implementation uses strict full-val scoring and prefix-only gating"
+  },
+  "review_notes": "The submitted score is a byte-level online PPM mixture over full validation bytes. Logs include nn_token_bpb, nn_byte_bpb, ppm_only, and mix_bpb for auditability."
+}
diff --git a/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py b/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py
diff --git a/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed1337.log b/records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_seed1337.log
@@ -0,0 +1,207 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/strict_ppm_trim_seed1337_8gpu_order4_b32.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  ppm_conf_threshold: 0.9
+  ppm_debug_subset_tokens: 0
+  ppm_enabled: True
+  ppm_lambda_hi: 0.9
+  ppm_lambda_lo: 0.05
+  ppm_log_cache_size: 1048576
+  ppm_native_enabled: True
+  ppm_order: 4
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: strict_ppm_trim_seed1337_8gpu_order4_b32
+  scalar_lr: 0.02
+  seed: 1337
+  skip_gates_enabled: True
+  skip_quantized_eval: True
+  sliding_batch_seqs: 32
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
+Running PyTorch 2.9.1+cu128
+Mon Apr 27 05:17:01 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
+| N/A   34C    P0            113W /  700W |    1521MiB /  81559MiB |      7%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
+| N/A   33C    P0            118W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
+| N/A   30C    P0            115W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
+| N/A   34C    P0            120W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
+| N/A   36C    P0            114W /  700W |    1521MiB /  81559MiB |      2%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
+| N/A   33C    P0            116W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
+| N/A   35C    P0            117W /  700W |    1521MiB /  81559MiB |      6%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
+| N/A   31C    P0            114W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 80
+val_tokens: 40540160
+model_params:35944536
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0047 val_bpb: 3.4860
+1/20000 train_loss: 9.0060 train_time: 0.0m tok/s: 8286060
+2/20000 train_loss: 12.2716 train_time: 0.0m tok/s: 8145622
+3/20000 train_loss: 10.8878 train_time: 0.0m tok/s: 8073352
+4/20000 train_loss: 9.3860 train_time: 0.0m tok/s: 8036703
+5/20000 train_loss: 8.2581 train_time: 0.0m tok/s: 7964242
+500/20000 train_loss: 3.3773 train_time: 0.8m tok/s: 7805622
+1000/20000 train_loss: 3.2821 train_time: 1.7m tok/s: 7797723
+1500/20000 train_loss: 3.1849 train_time: 2.5m tok/s: 7796501
+2000/20000 train_loss: 3.0673 train_time: 3.4m tok/s: 7798612
+layer_loop:enabled step:2041 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.1184 train_time: 4.6m tok/s: 7180229
+3000/20000 train_loss: 2.8974 train_time: 5.8m tok/s: 6782255
+3500/20000 train_loss: 2.9430 train_time: 7.0m tok/s: 6524969
+4000/20000 train_loss: 2.8233 train_time: 8.3m tok/s: 6337005
+4000/20000 val_loss: 2.8774 val_bpb: 1.1139
+4500/20000 train_loss: 2.8405 train_time: 9.5m tok/s: 6205031
+4620/20000 val_loss: 2.8091 val_bpb: 1.0875
+stopping_early: wallclock_cap train_time: 588086ms step: 4620/20000
+peak memory allocated: 39046 MiB reserved: 39070 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.80594917 val_bpb:1.08627037 eval_time:6829ms
+Serialized model: 135431033 bytes
+Code size: 21432 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 12.7s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
+Serialized model quantized+brotli: 15972171 bytes
+Total submission size quantized+brotli: 15993603 bytes
+quantized:skipped by SKIP_QUANTIZED_EVAL=1
+sliding_collect:start total_windows=633409 my_windows=79176 tokens=5069248 rank=0
+sliding_collect:rank_local_done rank=0 tokens=5069248 first=0 last=5069247 seconds=93.6
+sliding_collect:gather_done tokens=40540160 wait=0.6s total=94.2s
+ppm_native:start tokens=40540160
+ppm_full_native tokens=40540160 bytes=151078222 mix_bpb=1.00425333 ppm_only=2.22566472 nn_byte_bpb=1.07951474 nn_token_bpb=1.07951474 gate_high_frac=0.154305 order=4 lambda_hi=0.9 lambda_lo=0.05 threshold=0.9 log_cache=1048576
+ppm_time:252.2s native=True full_val=True scored_tokens=40540160
+quantized_sliding_window val_loss:2.59409066 val_bpb:1.00425333 eval_time:346457ms