openai · Kbediako · Apr 30, 2026 · May 1, 2026
diff --git a/records/track_non_record_16mb/2026-04-30_MHCLite_AttnSink_Local5080/README.md b/records/track_non_record_16mb/2026-04-30_MHCLite_AttnSink_Local5080/README.md
@@ -0,0 +1,110 @@
+# Non-Record: mHC-Lite Residual Mixing + Attention Sink Probe
+
+**local val_bpb: 1.4246** (3-seed mean on local RTX 5080 smoke train shard + full validation) | **13.36 MB max artifact** | **non-record submission**
+
+This is a non-record submission. It is not an official leaderboard claim: the bundled scores were produced locally on one RTX 5080 using the `sp1024` smoke setup with one train shard and the full validation split. Two 8xH100 Runpod attempts were made on the final day, but both pods exited before producing a validation score.
+
+## Results
+
+Local command family:
+
+```bash
+SEED=<seed> \
+MAX_WALLCLOCK_SECONDS=598 \
+DATA_PATH=./data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+MHC_LITE_ENABLED=1 \
+MHC_RESID_INIT_LOGIT=4.0 \
+MHC_SKIP_MODE=none \
+GRAD_CLIP_NORM=1.0625 \
+LOGIT_SOFTCAP=15 \
+QK_GAIN_INIT=3 \
+WARMDOWN_ITERS=150 \
+MUON_MOMENTUM=0.97 \
+MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=1500 \
+LATE_GATE_RAMP_ENABLED=0 \
+MTP_AUX_ENABLED=0 \
+ATTN_SINK_ENABLED=1 \
+ATTN_SINK_INIT=-6.5 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+| Seed | Local val_bpb | Train time | Steps | Eval time | Artifact bytes |
+|---:|---:|---:|---:|---:|---:|
+| 42 | 1.42542064 | 598.341s | 516 | 40.017s | 13,332,695 |
+| 1337 | 1.42551199 | 598.587s | 518 | 38.697s | 13,335,887 |
+| 2025 | 1.42293649 | 598.504s | 522 | 40.256s | 13,363,475 |
+| **Mean** | **1.42462304** | | | | |
+
+For context, the same local configuration with the default 600-second cap reached a 3-seed mean of 1.42271793, but those runs slightly exceeded the strict 600-second train cap on the local machine (`600.675s` to `600.707s`), so the safer 598-second numbers above are the submitted local evidence.
+
+## What Changed
+
+This keeps the root `sp1024` transformer scaffold and adds two optional clean-neural mechanisms, disabled by default unless the environment variables above are set.
+
+1. **mHC-lite residual mixing**
+   - Reparameterizes each block's residual/input mixing weights through a softmax.
+   - Initializes the residual lane strongly with logits `[+4, -4]`.
+   - Leaves the default behavior unchanged when `MHC_LITE_ENABLED=0`.
+
+2. **Attention sink**
+   - Adds a learned per-head sigmoid scale for the first value vector in the current causal sequence.
+   - This is causal: each position only receives information from position 0 of the same left-to-right context.
+   - The best local initialization in the sweep was `ATTN_SINK_INIT=-6.5`.
+
+3. **Muon schedule**
+   - Uses `MUON_MOMENTUM=0.97` with warmup from `0.92`.
+   - Local probes at `0.95` and `0.99` were worse on seed 42.
+
+Negative local probes included late gate ramping, shared-head MTP auxiliary loss, nearby attention-sink initializations, and nearby mHC initialization strengths.
+
+## Compliance Notes
+
+- No test-time training.
+- No casefolding, lossy normalization, PPM, byte sidecar, validation cache, or future-token cache.
+- No network access or external binaries are used during evaluation.
+- Full local validation split is evaluated.
+- Artifact accounting is `train_gpt.py` bytes plus compressed int8+zlib model bytes.
+- Max local counted artifact in the safe three-seed set is 13,363,475 bytes, below the decimal 16,000,000 byte cap. The bundled script is 105 bytes larger than the logged training script because the final package includes the guarded `nvidia-smi` logging fix, so the table adds 105 bytes to the logged totals.
+
+## 8xH100 Attempts
+
+Two Runpod 8xH100 SXM attempts were made after local triage:
+
+- `runpod_sinkm65_cap598_seed42_20260430T062654Z`: pod exited during compile before warmup completed.
+- `runpod_sinkm65_cap300_seed42_20260430T063624Z`: recovery pod reached training step 250 (`train_loss=2.7375`) and then exited before validation/export.
+
+Both attempts are logged in `runs/flywheel_manifest.jsonl` in the working tree. They produced no official `val_bpb`, so this folder is intentionally placed under `track_non_record_16mb`.
+
+## Reproduction
+
+From this folder:
+
+```bash
+python -m py_compile train_gpt.py
+
+SEED=42 \
+MAX_WALLCLOCK_SECONDS=598 \
+DATA_PATH=../../../data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=../../../data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+MHC_LITE_ENABLED=1 \
+MHC_RESID_INIT_LOGIT=4.0 \
+MHC_SKIP_MODE=none \
+GRAD_CLIP_NORM=1.0625 \
+LOGIT_SOFTCAP=15 \
+QK_GAIN_INIT=3 \
+WARMDOWN_ITERS=150 \
+MUON_MOMENTUM=0.97 \
+MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=1500 \
+LATE_GATE_RAMP_ENABLED=0 \
+MTP_AUX_ENABLED=0 \
+ATTN_SINK_ENABLED=1 \
+ATTN_SINK_INIT=-6.5 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+Use `--nproc_per_node=8` on a stable 8xH100 box for an official-style run. No accepted leaderboard score is claimed here.
diff --git a/records/track_non_record_16mb/2026-04-30_MHCLite_AttnSink_Local5080/requirements.txt b/records/track_non_record_16mb/2026-04-30_MHCLite_AttnSink_Local5080/requirements.txt
@@ -0,0 +1,3 @@
+numpy
+torch
+sentencepiece
diff --git a/records/track_non_record_16mb/2026-04-30_MHCLite_AttnSink_Local5080/submission.json b/records/track_non_record_16mb/2026-04-30_MHCLite_AttnSink_Local5080/submission.json
@@ -0,0 +1,67 @@
+{
+  "name": "mHC-lite Residual Mixing + Attention Sink Local Probe",
+  "track": "track_non_record_16mb",
+  "author": "Kbediako",
+  "github_id": "Kbediako",
+  "date": "2026-04-30",
+  "val_bpb": 1.42462304,
+  "val_bpb_std": 0.00145948,
+  "score_scope": "local RTX 5080 smoke train shard plus full validation; not an official leaderboard score",
+  "official_8xh100_val_bpb": null,
+  "bytes_total": 13363475,
+  "hardware": "1x NVIDIA RTX 5080 local WSL",
+  "seeds": [
+    42,
+    1337,
+    2025
+  ],
+  "seed_results": {
+    "42": {
+      "val_bpb": 1.42542064,
+      "val_loss_nats": 2.40676305,
+      "train_wallclock_s": 598.341,
+      "eval_wallclock_s": 40.017,
+      "train_steps": 516,
+      "artifact_bytes": 13332695
+    },
+    "1337": {
+      "val_bpb": 1.42551199,
+      "val_loss_nats": 2.40691728,
+      "train_wallclock_s": 598.587,
+      "eval_wallclock_s": 38.697,
+      "train_steps": 518,
+      "artifact_bytes": 13335887
+    },
+    "2025": {
+      "val_bpb": 1.42293649,
+      "val_loss_nats": 2.40256866,
+      "train_wallclock_s": 598.504,
+      "eval_wallclock_s": 40.256,
+      "train_steps": 522,
+      "artifact_bytes": 13363475
+    }
+  },
+  "blurb": "Clean-neural local probe adding optional mHC-lite softmax residual mixing and a causal per-head attention sink to the root sp1024 scaffold. No TTT, casefold, PPM, byte sidecar, external binary, or network dependency during evaluation. Two 8xH100 Runpod attempts exited before producing a score, so this is submitted as a non-record local-evidence package.",
+  "compliance": {
+    "record_claim": false,
+    "artifact_under_16mb": true,
+    "train_under_600s_local": true,
+    "eval_under_600s_local": true,
+    "official_8xh100_verified": false,
+    "no_validation_training": true,
+    "no_ttt": true,
+    "no_casefold": true,
+    "no_ppm": true,
+    "self_contained_eval": true
+  },
+  "runpod_attempts": [
+    {
+      "run_id": "runpod_sinkm65_cap598_seed42_20260430T062654Z",
+      "status": "pod exited during compile before score"
+    },
+    {
+      "run_id": "runpod_sinkm65_cap300_seed42_20260430T063624Z",
+      "status": "pod exited after training step 250 before score"
+    }
+  ]
+}