Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Record: SP8192 + Strict Full-Val Byte PPM Mixture

**val_bpb = 1.0049** (3-seed mean, std 0.0007) | **~15.995 MB** | 8xH100 SXM

This submission starts from the merged 2026-04-09 SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 base stack, removes eval-time TTT from the packed script, and adds a strict full-validation byte-level PPM mixture in `eval_val_sliding`.

## 3-Seed Results

| Seed | Post-EMA BPB | **PPM BPB** | Artifact |
|------|-------------:|------------:|---------:|
| 42 | 1.0871 | **1.0049** | 15,997,433 |
| 7 | 1.0875 | **1.0057** | 15,995,226 |
| 1337 | 1.0863 | **1.0043** | 15,993,603 |
| **Mean** | **1.0869** | **1.0049** | **15,995,421** |
| **Std** | | **0.0007** | |

Compared to the 2026-04-09 base record's legal TTT mean of 1.0810 BPB, this strict PPM mixture improves by **0.0761 BPB**. The plain NN scores reported by the scorer (`nn_token_bpb`) remain around 1.0795-1.0812; the gain is from the online byte PPM mixture.

## Key Techniques

1. **SP8192 base stack** - inherits the merged SP8192 + GPTQ SDClip + 3-layer recurrence + parallel residuals + QK-gain 5.25 architecture and training recipe.
2. **Strict full-val byte PPM** - reconstructs the byte stream from already-scored target tokens and SentencePiece byte LUTs, then scores every byte with a prefix-only PPM model.
3. **Prefix-only binary gate** - chooses the NN/PPM mixture lambda from context confidence before observing the current byte, avoiding target-conditioned gate selection.
4. **Score-before-update byte order** - every byte is scored from previous bytes only, then inserted into the PPM tables for future bytes.
5. **Native C scorer** - runtime-compiled open-addressed context tables, rolling context keys, inline byte counts, fixed order-0 counts, cached integer logs, and precomputed lambda logs.
6. **Compact sliding collection** - per-rank raw token/NLL files in `/tmp`; rank 0 gathers and runs the strict sequential PPM scorer. No full-length GPU position buffers or large NCCL all-reduces.
7. **Eval-time controls for budget** - `SKIP_QUANTIZED_EVAL=1`, `SLIDING_BATCH_SEQS=32`, and `PPM_LOG_CACHE_SIZE=1048576` keep full eval under 600s.

## Architecture And Training

The neural base is unchanged from the 2026-04-09 SP8192 record: 11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, partial RoPE, layerwise LN scale, tied embeddings, logit softcap=30, depth recurrence over layers 3-5, and parallel residuals from layer 7.

Training uses the inherited MuonEq-R/AdamW recipe, EMA 0.9965, WD 0.095, matrix LR 0.022, warmdown 0.72, and a 588s effective train cap (`MAX_WALLCLOCK_SECONDS=600`, `GPTQ_RESERVE_SECONDS=12`).

## Quantization

Full-Hessian GPTQ with SDClip is unchanged from the base stack: int6 attention/MLP matrices, int8 token embeddings, float16 scalar/gating parameters, byte-shuffle, and Brotli-11 compression. The packed script is trimmed to 21.4KB by removing TTT and the Python PPM reference from the final artifact.

## Strict PPM Evaluation

For each scored target token, the scorer spreads the NN token NLL uniformly over that token's emitted bytes. If a SentencePiece leading-space marker should contribute an actual space byte, that byte is scored first. For every byte:

1. Build context keys from previous bytes only.
2. Score the byte with PPM-D style escape probabilities.
3. Compute context confidence as `max_count / (total + unique)` at the deepest available prefix context.
4. Use `lambda_lo` if confidence is high, otherwise `lambda_hi`.
5. Mix normalized NN byte probability and PPM byte probability.
6. Update byte counts after scoring.

Default parameters used for the record logs:

| Env var | Value |
|---|---:|
| `PPM_ORDER` | `4` |
| `PPM_LAMBDA_HI` | `0.9` |
| `PPM_LAMBDA_LO` | `0.05` |
| `PPM_CONF_THRESHOLD` | `0.9` |
| `PPM_LOG_CACHE_SIZE` | `1048576` |
| `SKIP_QUANTIZED_EVAL` | `1` |
| `SLIDING_BATCH_SEQS` | `32` |

## Compliance

Per Issue #1017-style eval-time constraints:

- **Causality:** The neural model is evaluated by causal sliding windows. The byte PPM table only contains previous bytes at the time each byte is scored.
- **Score before update:** PPM counts are updated only after the current byte's mixed log-probability is recorded.
- **Full validation:** Formal logs use all 40,540,160 scored target tokens / 151,078,222 bytes. Debug subsets are non-scoring.
- **Single scoring path:** The returned `quantized_sliding_window` BPB is the PPM mixture score for the full stream; there is no post-hoc best-of selection.
- **No SLOT.**
- **No TTT in the packed artifact.**
- **No pre-quant validation adaptation.**
- **No ETLB/logit bias.**
- **No n-gram cache or precomputed validation cache.**
- **Artifact under 16,000,000 bytes on all three seeds.**
- **Training under 600s on all three seeds.**
- **PPM eval under 600s on all three seeds.**

Review note: this is a byte-level online mixture scoring object rather than a pure token-level NN score. The logs report `nn_token_bpb`, `nn_byte_bpb`, `ppm_only`, and `mix_bpb` for auditability.

## Reproduction

```bash
python3 -m pip install brotli sentencepiece
# install the same flash-attention package used by the base SP8192 records if missing
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

RUN_ID=strict_ppm_trim_seed42_8gpu_order4_b32 \
SEED=42 \
PPM_ENABLED=1 \
PPM_NATIVE_ENABLED=1 \
PPM_ORDER=4 \
PPM_LAMBDA_HI=0.9 \
PPM_LAMBDA_LO=0.05 \
PPM_CONF_THRESHOLD=0.9 \
PPM_LOG_CACHE_SIZE=1048576 \
SKIP_QUANTIZED_EVAL=1 \
SLIDING_BATCH_SEQS=32 \
torchrun --standalone --nproc_per_node=8 \
records/track_10min_16mb/2026-04-26_SP8192_StrictFullValPPM/train_gpt.py
```

Change `SEED` and `RUN_ID` for seeds 7 and 1337.

## Credits

- Base stack: merged 2026-04-09 SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 + legal TTT record and credited lineage.
- PPM idea lineage: PR #1835 / #1795 discussion. This version changes the implementation to strict full-val scoring, prefix-only gating, native sequential scoring, and no subset claim.

## Included Files

- `README.md`
- `submission.json`
- `train_gpt.py`
- `train_seed42.log`
- `train_seed7.log`
- `train_seed1337.log`
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
{
"author": "someone114514",
"github_id": "someone114514",
"name": "SP8192 + Strict Full-Val Byte PPM Mixture",
"date": "2026-04-26",
"track": "10min_16mb",
"val_bpb": 1.00495,
"val_bpb_std": 0.00072,
"seeds": [42, 7, 1337],
"seed_results": {
"42": {
"val_bpb": 1.00489563,
"pre_quant_bpb": 1.08711004,
"artifact_bytes": 15997433,
"train_time_s": 588.065,
"eval_time_s": 393.717
},
"7": {
"val_bpb": 1.00569239,
"pre_quant_bpb": 1.08750246,
"artifact_bytes": 15995226,
"train_time_s": 588.142,
"eval_time_s": 343.606
},
"1337": {
"val_bpb": 1.00425333,
"pre_quant_bpb": 1.08627037,
"artifact_bytes": 15993603,
"train_time_s": 588.086,
"eval_time_s": 346.457
}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"base_record": "2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT",
"base_val_bpb_ttt": 1.08100,
"base_val_bpb_sliding": 1.08270,
"technique_summary": "SP8192 base stack + strict full-validation byte-level PPM-D mixture in eval_val_sliding, prefix-only confidence gate, score-before-update byte counts, native C scorer, compact raw per-rank collection, and no TTT in the packed artifact.",
"new_controls": {
"PPM_ENABLED": "0 by default; set to 1 to enable strict PPM mixture",
"PPM_ORDER": 4,
"PPM_LAMBDA_HI": 0.9,
"PPM_LAMBDA_LO": 0.05,
"PPM_CONF_THRESHOLD": 0.9,
"PPM_DEBUG_SUBSET_TOKENS": "0 for formal full-val scoring; positive values are debug-only and must not be submitted",
"PPM_NATIVE_ENABLED": "1 by default; runtime-compiled C scorer",
"PPM_LOG_CACHE_SIZE": 1048576,
"SKIP_QUANTIZED_EVAL": "1 in submitted logs; skips plain quantized eval",
"SLIDING_BATCH_SEQS": 32
},
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"no_slot": true,
"no_ttt_in_packed_artifact": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_byte_ppm": true,
"prefix_only_gate": true,
"full_val_required_for_claimed_score": true,
"three_seeds": true
},
"attribution": {
"base_stack": "2026-04-09 SP8192 + 3-layer recurrence + parallel residuals + QK-gain 5.25 + legal TTT record and credited lineage",
"ppm_idea": "PR #1835 / #1795 discussion; this implementation uses strict full-val scoring and prefix-only gating"
},
"review_notes": "The submitted score is a byte-level online PPM mixture over full validation bytes. Logs include nn_token_bpb, nn_byte_bpb, ppm_only, and mix_bpb for auditability."
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
====================================================================================================
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp8192
distributed: True
ema_decay: 0.9965
embed_bits: 8
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.35
etlb_clip: 3.0
etlb_enabled: False
etlb_lr: 0.05
etlb_steps: 5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/strict_ppm_trim_seed1337_8gpu_order4_b32.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_residual_start: 7
ppm_conf_threshold: 0.9
ppm_debug_subset_tokens: 0
ppm_enabled: True
ppm_lambda_hi: 0.9
ppm_lambda_lo: 0.05
ppm_log_cache_size: 1048576
ppm_native_enabled: True
ppm_order: 4
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: strict_ppm_trim_seed1337_8gpu_order4_b32
scalar_lr: 0.02
seed: 1337
skip_gates_enabled: True
skip_quantized_eval: True
sliding_batch_seqs: 32
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp8192/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp8192/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 8192
warmdown_frac: 0.72
warmup_steps: 20
world_size: 8
xsa_last_n: 11
====================================================================================================
Running Python 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0]
Running PyTorch 2.9.1+cu128
Mon Apr 27 05:17:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 34C P0 113W / 700W | 1521MiB / 81559MiB | 7% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 33C P0 118W / 700W | 1521MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 30C P0 115W / 700W | 1521MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 34C P0 120W / 700W | 1521MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 36C P0 114W / 700W | 1521MiB / 81559MiB | 2% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 33C P0 116W / 700W | 1521MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 35C P0 117W / 700W | 1521MiB / 81559MiB | 6% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 31C P0 114W / 700W | 1521MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

====================================================================================================
train_shards: 80
val_tokens: 40540160
model_params:35944536
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.0047 val_bpb: 3.4860
1/20000 train_loss: 9.0060 train_time: 0.0m tok/s: 8286060
2/20000 train_loss: 12.2716 train_time: 0.0m tok/s: 8145622
3/20000 train_loss: 10.8878 train_time: 0.0m tok/s: 8073352
4/20000 train_loss: 9.3860 train_time: 0.0m tok/s: 8036703
5/20000 train_loss: 8.2581 train_time: 0.0m tok/s: 7964242
500/20000 train_loss: 3.3773 train_time: 0.8m tok/s: 7805622
1000/20000 train_loss: 3.2821 train_time: 1.7m tok/s: 7797723
1500/20000 train_loss: 3.1849 train_time: 2.5m tok/s: 7796501
2000/20000 train_loss: 3.0673 train_time: 3.4m tok/s: 7798612
layer_loop:enabled step:2041 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
2500/20000 train_loss: 3.1184 train_time: 4.6m tok/s: 7180229
3000/20000 train_loss: 2.8974 train_time: 5.8m tok/s: 6782255
3500/20000 train_loss: 2.9430 train_time: 7.0m tok/s: 6524969
4000/20000 train_loss: 2.8233 train_time: 8.3m tok/s: 6337005
4000/20000 val_loss: 2.8774 val_bpb: 1.1139
4500/20000 train_loss: 2.8405 train_time: 9.5m tok/s: 6205031
4620/20000 val_loss: 2.8091 val_bpb: 1.0875
stopping_early: wallclock_cap train_time: 588086ms step: 4620/20000
peak memory allocated: 39046 MiB reserved: 39070 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.80594917 val_bpb:1.08627037 eval_time:6829ms
Serialized model: 135431033 bytes
Code size: 21432 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 12.7s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int8): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
Serialized model quantized+brotli: 15972171 bytes
Total submission size quantized+brotli: 15993603 bytes
quantized:skipped by SKIP_QUANTIZED_EVAL=1
sliding_collect:start total_windows=633409 my_windows=79176 tokens=5069248 rank=0
sliding_collect:rank_local_done rank=0 tokens=5069248 first=0 last=5069247 seconds=93.6
sliding_collect:gather_done tokens=40540160 wait=0.6s total=94.2s
ppm_native:start tokens=40540160
ppm_full_native tokens=40540160 bytes=151078222 mix_bpb=1.00425333 ppm_only=2.22566472 nn_byte_bpb=1.07951474 nn_token_bpb=1.07951474 gate_high_frac=0.154305 order=4 lambda_hi=0.9 lambda_lo=0.05 threshold=0.9 log_cache=1048576
ppm_time:252.2s native=True full_val=True scored_tokens=40540160
quantized_sliding_window val_loss:2.59409066 val_bpb:1.00425333 eval_time:346457ms
Loading