openai · someone114514 · Apr 27, 2026
diff --git a/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/README.md b/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/README.md
@@ -0,0 +1,66 @@
+# SP8192 + Order-6 Strict Full-Val Byte PPM
+
+**val_bpb = 0.96255** (3-seed mean, std 0.00047) | **15.997 MB mean artifact** | 8xH100 SXM
+
+This submission keeps the SP8192 recurrence / parallel-residual / QK-gain base stack and replaces the prior order-4 PPM setting with a strict full-validation order-6 byte-level PPM mixture at eval time. The PPM state is built online from the already-scored byte prefix, then updated only after each byte is scored.
+
+## Results
+
+| Seed | Post-EMA BPB | PPM BPB | Artifact bytes | Eval time |
+| --- | ---: | ---: | ---: | ---: |
+| 42 | 1.08754884 | 0.96261595 | 15,996,904 | 474.016s |
+| 7 | 1.08763287 | 0.96298648 | 15,999,992 | 464.055s |
+| 1337 | 1.08663175 | 0.96205812 | 15,994,492 | 463.261s |
+| **Mean** | **1.08727115** | **0.96255352** | **15,997,129** | **467.111s** |
+| **Std** | **0.00055533** | **0.00046732** | **2,757** | **5.993s** |
+
+The best seed is 1337 at `0.96205812` BPB. The largest observed total submission size is `15,999,992` bytes, still under the 16,000,000 byte cap.
+
+## Method
+
+The eval path first computes the normal sliding-window neural-network NLLs with stride 64. It then converts the scored token stream into byte contributions and mixes the NN byte probability with an order-6 byte PPM-D probability:
+
+`p_mix = lambda * p_nn + (1 - lambda) * p_ppm`
+
+The gate is binary and prefix-only. With the submitted settings, PPM is trusted more when its longest-context top-symbol confidence is at least `0.9`; otherwise the NN dominates.
+
+| Setting | Value |
+| --- | ---: |
+| `PPM_ORDER` | `6` |
+| `PPM_LAMBDA_HI` | `0.9` |
+| `PPM_LAMBDA_LO` | `0.05` |
+| `PPM_CONF_THRESHOLD` | `0.9` |
+| `PPM_LOG_CACHE_SIZE` | `1048576` |
+| `SKIP_QUANTIZED_EVAL` | `1` |
+| `SLIDING_BATCH_SEQS` | `32` |
+
+Order 6 was selected after full-val checks. Order 7 and order 8 were slower and worse on seed 42, so they are not part of the submitted result.
+
+## Compliance
+
+- Causal scoring: both NN scoring and PPM scoring use only the prefix available before the current byte.
+- Score before update: PPM counts are updated after the byte's mixed log-probability is recorded.
+- Single pass: validation bytes are scored once in order; there is no rescoring or best-of-run selection.
+- Normalized distribution: PPM-D produces a valid byte distribution and the mixture is performed in probability space.
+- Full validation: submitted scores use the full validation stream, not a subset.
+- No SLOT, no TTT, no ETLB, and no n-gram cache in the submitted packed artifact.
+
+## Reproduce
+
+```bash
+RUN_ID=strict_ppm_order6_seed42 \
+SEED=42 \
+PPM_ENABLED=1 \
+PPM_NATIVE_ENABLED=1 \
+PPM_ORDER=6 \
+PPM_LAMBDA_HI=0.9 \
+PPM_LAMBDA_LO=0.05 \
+PPM_CONF_THRESHOLD=0.9 \
+PPM_LOG_CACHE_SIZE=1048576 \
+SKIP_QUANTIZED_EVAL=1 \
+SLIDING_BATCH_SEQS=32 \
+torchrun --standalone --nproc_per_node=8 \
+  records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/train_gpt.py
+```
+
+Change `SEED` and `RUN_ID` to reproduce the other two logs.
diff --git a/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/submission.json b/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/submission.json
@@ -0,0 +1,62 @@
+{
+  "author": "someone114514",
+  "github_id": "someone114514",
+  "name": "SP8192 + Order-6 Strict Full-Val Byte PPM",
+  "date": "2026-04-27",
+  "track": "10min_16mb",
+  "val_bpb": 0.96255352,
+  "val_bpb_std": 0.00046732,
+  "seeds": [42, 7, 1337],
+  "seed_results": {
+    "42": {
+      "val_bpb": 0.96261595,
+      "pre_quant_bpb": 1.08754884,
+      "artifact_bytes": 15996904,
+      "train_time_s": 588.147,
+      "eval_time_s": 474.016
+    },
+    "7": {
+      "val_bpb": 0.96298648,
+      "pre_quant_bpb": 1.08763287,
+      "artifact_bytes": 15999992,
+      "train_time_s": 588.102,
+      "eval_time_s": 464.055
+    },
+    "1337": {
+      "val_bpb": 0.96205812,
+      "pre_quant_bpb": 1.08663175,
+      "artifact_bytes": 15994492,
+      "train_time_s": 588.135,
+      "eval_time_s": 463.261
+    }
+  },
+  "hardware": "8xH100 80GB SXM",
+  "base_record": "2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT",
+  "technique_summary": "SP8192 base stack plus strict full-validation order-6 byte-level PPM-D mixture in eval_val_sliding, prefix-only confidence gate, score-before-update byte counts, native C scorer, and compact raw per-rank collection.",
+  "new_controls": {
+    "PPM_ENABLED": 1,
+    "PPM_NATIVE_ENABLED": 1,
+    "PPM_ORDER": 6,
+    "PPM_LAMBDA_HI": 0.9,
+    "PPM_LAMBDA_LO": 0.05,
+    "PPM_CONF_THRESHOLD": 0.9,
+    "PPM_LOG_CACHE_SIZE": 1048576,
+    "SKIP_QUANTIZED_EVAL": 1,
+    "SLIDING_BATCH_SEQS": 32
+  },
+  "compliance": {
+    "train_under_600s": true,
+    "artifact_under_16mb": true,
+    "eval_under_600s": true,
+    "no_slot": true,
+    "no_ttt_in_packed_artifact": true,
+    "no_pre_quant_ttt": true,
+    "no_etlb": true,
+    "no_ngram_cache": true,
+    "score_first_byte_ppm": true,
+    "prefix_only_gate": true,
+    "full_val_required_for_claimed_score": true,
+    "three_seeds": true
+  },
+  "review_notes": "The submitted score is a strict online byte-level PPM mixture over full validation bytes. Logs include nn_token_bpb, nn_byte_bpb, ppm_only, mix_bpb, gate_high_frac, artifact size, and eval time for auditability."
+}
diff --git a/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/train_gpt.py b/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/train_gpt.py
diff --git a/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/train_seed1337.log b/records/track_10min_16mb/2026-04-27_SP8192_Order6StrictBytePPM/train_seed1337.log
@@ -0,0 +1,207 @@
+====================================================================================================
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp8192
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 8
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/strict_ppm_order6_seed1337.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  ppm_conf_threshold: 0.9
+  ppm_debug_subset_tokens: 0
+  ppm_enabled: True
+  ppm_lambda_hi: 0.9
+  ppm_lambda_lo: 0.05
+  ppm_log_cache_size: 1048576
+  ppm_native_enabled: True
+  ppm_order: 6
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: strict_ppm_order6_seed1337
+  scalar_lr: 0.02
+  seed: 1337
+  skip_gates_enabled: True
+  skip_quantized_eval: True
+  sliding_batch_seqs: 32
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 8192
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+====================================================================================================
+Running Python 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
+Running PyTorch 2.9.1+cu128
+Mon Apr 27 22:39:04 2026       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
++-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
+| N/A   35C    P0            117W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
+| N/A   32C    P0            116W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
+| N/A   31C    P0            115W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
+| N/A   34C    P0            119W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
+| N/A   35C    P0            121W /  700W |    1521MiB /  81559MiB |      3%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
+| N/A   32C    P0            114W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
+| N/A   34C    P0            120W /  700W |    1521MiB /  81559MiB |      0%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
+| N/A   30C    P0            117W /  700W |    1521MiB /  81559MiB |      6%      Default |
+|                                         |                        |             Disabled |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+====================================================================================================
+train_shards: 80
+val_tokens: 40540160
+model_params:35944536
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.0047 val_bpb: 3.4860
+1/20000 train_loss: 9.0060 train_time: 0.0m tok/s: 8319096
+2/20000 train_loss: 12.2716 train_time: 0.0m tok/s: 8200076
+3/20000 train_loss: 10.8879 train_time: 0.0m tok/s: 8100521
+4/20000 train_loss: 9.3861 train_time: 0.0m tok/s: 8054164
+5/20000 train_loss: 8.2582 train_time: 0.0m tok/s: 8015962
+500/20000 train_loss: 3.3812 train_time: 0.8m tok/s: 7746256
+1000/20000 train_loss: 3.2822 train_time: 1.7m tok/s: 7732426
+1500/20000 train_loss: 3.1813 train_time: 2.5m tok/s: 7734682
+2000/20000 train_loss: 3.0711 train_time: 3.4m tok/s: 7737149
+layer_loop:enabled step:2025 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.1182 train_time: 4.6m tok/s: 7106922
+3000/20000 train_loss: 2.8975 train_time: 5.9m tok/s: 6718371
+3500/20000 train_loss: 2.9427 train_time: 7.1m tok/s: 6466400
+4000/20000 train_loss: 2.8185 train_time: 8.3m tok/s: 6285148
+4000/20000 val_loss: 2.8749 val_bpb: 1.1130
+4500/20000 train_loss: 2.8427 train_time: 9.6m tok/s: 6156105
+4589/20000 val_loss: 2.8100 val_bpb: 1.0878
+stopping_early: wallclock_cap train_time: 588135ms step: 4589/20000
+peak memory allocated: 39046 MiB reserved: 39070 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.80688266 val_bpb:1.08663175 eval_time:7014ms
+Serialized model: 135431033 bytes
+Code size: 21432 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 12.8s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int8): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
+Serialized model quantized+brotli: 15973060 bytes
+Total submission size quantized+brotli: 15994492 bytes
+quantized:skipped by SKIP_QUANTIZED_EVAL=1
+sliding_collect:start total_windows=633409 my_windows=79176 tokens=5069248 rank=0
+sliding_collect:rank_local_done rank=0 tokens=5069248 first=0 last=5069247 seconds=93.6
+sliding_collect:gather_done tokens=40540160 wait=1.0s total=94.6s
+ppm_native:start tokens=40540160
+ppm_full_native tokens=40540160 bytes=151078222 mix_bpb=0.96205812 ppm_only=2.13183272 nn_byte_bpb=1.07994528 nn_token_bpb=1.07994528 gate_high_frac=0.232357 order=6 lambda_hi=0.9 lambda_lo=0.05 threshold=0.9 log_cache=1048576
+ppm_time:368.6s native=True full_val=True scored_tokens=40540160
+quantized_sliding_window val_loss:2.48509604 val_bpb:0.96205812 eval_time:463261ms