openai · LauraGomezjurado · Apr 7, 2026 · Apr 7, 2026 · Apr 7, 2026 · Apr 11, 2026
diff --git a/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/README.md b/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/README.md
@@ -0,0 +1,111 @@
+# Notable Non-Record Submission: 1.0960 BPB — Muon + gated Krylov
+
+Muon with a small gated Krylov correction on square, nonnormal slices. Standard SentencePiece GPT path, AR self-generated Full-Hessian GPTQ, selective `±1` pruning, and sliding-window evaluation.
+
+**val_bpb: 1.09596320** (sliding, seed=`1337`) | **15,957,504 bytes** | **1xA100 80GB, 8h 52m**
+
+> This is a non-record submission. It fits under the `16,000,000` byte artifact cap, but it does not satisfy the challenge's main leaderboard wallclock requirement of `10 minutes on 8xH100 SXM`.
+
+## Results
+
+| Metric | Value |
+|--------|-------|
+| Sliding BPB | `1.09596320` |
+| Sliding val_loss | `1.85048306` |
+| Step-20000 val_bpb | `1.1166` |
+| Post-EMA val_bpb | `1.1156` |
+| Int6 roundtrip exact BPB | `1.11953265` |
+| Artifact bytes | `15,957,504` |
+| Compressed model bytes | `15,817,800` |
+| Code bytes | `139,704` |
+| Parameters | `26,993,756` |
+| Peak allocated VRAM | `29,336 MiB` |
+| Training time | `31,932,706 ms` (`8h 52m 12.706s`) |
+| Average step time | `1596.64 ms` |
+
+The exact training log for this run is [train_seed1337.log]. The current `train_gpt.py` in this folder includes a small CPU-import compatibility guard so the record imports cleanly during Python 3.10 / CPU smoke tests; that changes the code-byte count but does not affect the SentencePiece execution path used for the logged run.
+
+## Main Idea
+
+The optimizer stays in the Muon family. The change is not a replacement of Newton-Schulz with a different optimizer; it is a narrow correction path:
+
+1. Compute the standard Muon direction on the banked matrix weights.
+2. For square slices only, estimate nonnormality from the commutator `W^T W - W W^T`.
+3. Use a Hutchinson estimator to decide whether the slice is sufficiently nonnormal.
+4. Choose a small adaptive Krylov rank.
+5. Build a residual-direction correction and blend it back into the Muon direction with a small coefficient.
+
+In practice this worked best as a conservative hybrid. Muon remained the base geometry, and the Krylov branch only fired on a subset of slices.
+
+## Architecture
+
+| Component | Setting | First introduced by |
+|-----------|---------|---------------------|
+| Layers | 11 (512d, 8 GQA heads, 4 KV heads) | Baseline |
+| MLP | 3× (1536) with LeakyReLU(0.5)² | [#493](https://github.com/openai/parameter-golf/pull/493) @parinzee |
+| Attention | XSA on all 11 layers | [#478](https://github.com/openai/parameter-golf/pull/478) @gowtham0992 |
+| BigramHash | 3072 × dim=112 | [#1019](https://github.com/openai/parameter-golf/pull/1019) lineage (concept: [#162](https://github.com/openai/parameter-golf/pull/162) @raahilshah) |
+| RoPE | Partial (16/64 dims) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
+| LN Scale | 1/√(layer+1) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
+| VE128 | Layers 9-10 | [#374](https://github.com/openai/parameter-golf/pull/374) @unnir |
+| SmearGate | Position-mixing gate | [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman |
+| U-Net skips | Encoder-decoder connections | [#289](https://github.com/openai/parameter-golf/pull/289) |
+| Weight avg | EMA(0.997) + Tight SWA(every 50) | [#401](https://github.com/openai/parameter-golf/pull/401) @newjordan |
+| Quantization | Full Hessian GPTQ int6 (AR self-gen calibration) | [#1019](https://github.com/openai/parameter-golf/pull/1019) lineage (GPTQ: [#535](https://github.com/openai/parameter-golf/pull/535) @raahilshah) |
+| Compression | LZMA preset=9 | [#160](https://github.com/openai/parameter-golf/pull/160) @ChaseWNorton |
+| Warmdown | 4000 iterations | [#364](https://github.com/openai/parameter-golf/pull/364) @shikhar1729 |
+| Optimizer | Parallel Muon + Parameter Banking + **gated Krylov residual correction** | **This work**, built on [#399](https://github.com/openai/parameter-golf/pull/399) @abaybektursun |
+| Late QAT | STE at LR scale < 0.15 | [#286](https://github.com/openai/parameter-golf/pull/286) @chris-buckley |
+| Selective pruning | ±1 values by reconstruction error | [#609](https://github.com/openai/parameter-golf/pull/609) @saml212 |
+| Flash Attention 3 | Hopper warp-specialized kernels | [#122](https://github.com/openai/parameter-golf/pull/122) @mtybadger |
+
+
+## What Actually Ran
+
+This result used:
+
+- the standard SentencePiece `sp1024` tokenizer path
+- 11 layers, 512 dim, 8 attention heads, 4 KV heads
+- 3x MLP with LeakyReLU(0.5)^2
+- XSA across all 11 layers
+- BigramHash, SmearGate, VE128, partial RoPE, U-Net skips
+- EMA after training
+- AR self-generated Full-Hessian GPTQ int6 export
+- selective `±1` pruning to fit the official byte cap
+
+The exact script snapshot used for the run is [train_gpt.py]. It is the historical single-file training script copied from the A100 box, not the current evolving repo root script.
+
+### Brief HNet Result
+
+The later learned-HNet branch was measured on the same A100 box and finished at:
+
+| Variant | Sliding BPB | Int6 roundtrip exact BPB | Artifact |
+|---------|-------------|--------------------------|----------|
+| SentencePiece + Muon + gated Krylov | `1.09596320` | `1.11953265` | `15,957,504` |
+| HNet + Muon + gated Krylov | `1.42700113` | `1.51636243` | `15,554,948` |
+
+So HNet was comfortably under the byte cap, but much worse in quality. The main issue was not compression size; it was that the HNet path changed the representation and throughput in a way that hurt this setup.
+
+## Run Command
+
+```bash
+TARGET_MB=15.2587890625 \
+MUON_KRYLOV_ENABLED=1 \
+MUON_KRYLOV_ALPHA=0.05 \
+MUON_KRYLOV_ETA_THRESHOLD=0.03 \
+MUON_KRYLOV_WARMUP_STEPS=1000 \
+MUON_KRYLOV_DECISION_EVERY=100 \
+MUON_KRYLOV_EVERY=2 \
+MUON_KRYLOV_HUTCHINSON_SAMPLES=2 \
+MUON_KRYLOV_RANK_MAX=4 \
+MUON_KRYLOV_RANK_SCALE=1.0 \
+VAL_LOSS_EVERY=2000 \
+TRAIN_LOG_EVERY=200 \
+WARMUP_STEPS=20 \
+SEED=1337 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+## Bottom Line
+
+The useful result here is simple: keep the strong 11L SP stack, keep Muon as the main optimizer, and add only a small gated Krylov residual correction on top. That combination produced a strong under-cap non-record score of **1.09596320 BPB** on a single A100 training for ~9 hours.
diff --git a/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/requirements.txt b/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/requirements.txt
@@ -0,0 +1,4 @@
+numpy
+sentencepiece
+torch
+zstandard
diff --git a/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/run_a100_tmux.sh b/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/run_a100_tmux.sh
@@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+  cat <<'EOF'
+Usage:
+  REMOTE_HOST=<host> ./run_a100_tmux.sh
+
+Optional environment overrides:
+  REMOTE_USER          SSH user (default: root)
+  REMOTE_PORT          SSH port (default: 22)
+  REMOTE_KEY           SSH private key path (default: use your normal ssh config)
+  REMOTE_BASE_DIR      Remote records base dir
+  REMOTE_DIR           Exact remote record dir (overrides REMOTE_BASE_DIR)
+  SESSION              tmux session name
+  RUN_ID               training RUN_ID (default: SESSION)
+  DATA_PATH            remote dataset path
+  TOKENIZER_PATH       remote tokenizer path
+  NPROC_PER_NODE       torchrun worker count (default: 1)
+  TARGET_MB            export target in MiB (default: 15.2587890625)
+  SEED                 random seed (default: 1337)
+  MAX_WALLCLOCK_SECONDS training cap in seconds (default: 0)
+
+Example:
+  REMOTE_HOST=216.81.245.7 REMOTE_PORT=49989 REMOTE_KEY=~/.ssh/id_ed25519 ./run_a100_tmux.sh
+EOF
+  exit 0
+fi
+
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+RECORD_DIR_NAME="$(basename "${SCRIPT_DIR}")"
+
+REMOTE_HOST="${REMOTE_HOST:-}"
+REMOTE_USER="${REMOTE_USER:-root}"
+REMOTE_PORT="${REMOTE_PORT:-22}"
+REMOTE_KEY="${REMOTE_KEY:-}"
+
+REMOTE_BASE_DIR="${REMOTE_BASE_DIR:-/workspace/parametergolf/records/track_non_record_16mb}"
+REMOTE_DIR="${REMOTE_DIR:-${REMOTE_BASE_DIR}/${RECORD_DIR_NAME}}"
+
+SESSION="${SESSION:-$(printf '%s' "${RECORD_DIR_NAME}" | tr -cs '[:alnum:]_-' '_')}"
+RUN_ID="${RUN_ID:-${SESSION}}"
+
+DATA_PATH="${DATA_PATH:-/workspace/parameter-golf/data/datasets/fineweb10B_sp1024}"
+TOKENIZER_PATH="${TOKENIZER_PATH:-/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model}"
+NPROC_PER_NODE="${NPROC_PER_NODE:-1}"
+
+TARGET_MB="${TARGET_MB:-15.2587890625}"
+SEED="${SEED:-1337}"
+MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-0}"
+
+MUON_KRYLOV_ENABLED="${MUON_KRYLOV_ENABLED:-1}"
+MUON_KRYLOV_ALPHA="${MUON_KRYLOV_ALPHA:-0.05}"
+MUON_KRYLOV_ETA_THRESHOLD="${MUON_KRYLOV_ETA_THRESHOLD:-0.03}"
+MUON_KRYLOV_WARMUP_STEPS="${MUON_KRYLOV_WARMUP_STEPS:-1000}"
+MUON_KRYLOV_DECISION_EVERY="${MUON_KRYLOV_DECISION_EVERY:-100}"
+MUON_KRYLOV_EVERY="${MUON_KRYLOV_EVERY:-2}"
+MUON_KRYLOV_HUTCHINSON_SAMPLES="${MUON_KRYLOV_HUTCHINSON_SAMPLES:-2}"
+MUON_KRYLOV_RANK_MAX="${MUON_KRYLOV_RANK_MAX:-4}"
+MUON_KRYLOV_RANK_SCALE="${MUON_KRYLOV_RANK_SCALE:-1.0}"
+
+VAL_LOSS_EVERY="${VAL_LOSS_EVERY:-2000}"
+TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-200}"
+WARMUP_STEPS="${WARMUP_STEPS:-20}"
+
+if [[ -z "${REMOTE_HOST}" ]]; then
+  echo "error: set REMOTE_HOST before running this script" >&2
+  exit 1
+fi
+
+SSH_ARGS=("${REMOTE_USER}@${REMOTE_HOST}" -p "${REMOTE_PORT}")
+if [[ -n "${REMOTE_KEY}" ]]; then
+  SSH_ARGS+=(-i "${REMOTE_KEY}")
+fi
+
+ssh "${SSH_ARGS[@]}" 'bash -s' <<EOF
+set -euo pipefail
+cd "${REMOTE_DIR}"
+mkdir -p logs
+LOG="logs/${SESSION}_\$(date +%Y%m%d_%H%M%S).log"
+tmux kill-session -t "${SESSION}" 2>/dev/null || true
+tmux new-session -d -s "${SESSION}" "
+cd ${REMOTE_DIR} && env \\
+RUN_ID=${RUN_ID} \\
+DATA_PATH=${DATA_PATH} \\
+TOKENIZER_PATH=${TOKENIZER_PATH} \\
+TARGET_MB=${TARGET_MB} \\
+MUON_KRYLOV_ENABLED=${MUON_KRYLOV_ENABLED} \\
+MUON_KRYLOV_ALPHA=${MUON_KRYLOV_ALPHA} \\
+MUON_KRYLOV_ETA_THRESHOLD=${MUON_KRYLOV_ETA_THRESHOLD} \\
+MUON_KRYLOV_WARMUP_STEPS=${MUON_KRYLOV_WARMUP_STEPS} \\
+MUON_KRYLOV_DECISION_EVERY=${MUON_KRYLOV_DECISION_EVERY} \\
+MUON_KRYLOV_EVERY=${MUON_KRYLOV_EVERY} \\
+MUON_KRYLOV_HUTCHINSON_SAMPLES=${MUON_KRYLOV_HUTCHINSON_SAMPLES} \\
+MUON_KRYLOV_RANK_MAX=${MUON_KRYLOV_RANK_MAX} \\
+MUON_KRYLOV_RANK_SCALE=${MUON_KRYLOV_RANK_SCALE} \\
+VAL_LOSS_EVERY=${VAL_LOSS_EVERY} \\
+TRAIN_LOG_EVERY=${TRAIN_LOG_EVERY} \\
+WARMUP_STEPS=${WARMUP_STEPS} \\
+SEED=${SEED} \\
+MAX_WALLCLOCK_SECONDS=${MAX_WALLCLOCK_SECONDS} \\
+torchrun --standalone --nproc_per_node=${NPROC_PER_NODE} train_gpt.py 2>&1 | tee \${LOG}"
+echo "session:${SESSION}"
+echo "log:\${LOG}"
+EOF
diff --git a/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/submission.json b/records/track_non_record_16mb/2026-04-06_Muon_Krylov_SentencePiece_A100/submission.json
@@ -0,0 +1,11 @@
+{
+  "author": "Ganesh Talluri, Laura Gomezjurado, Hiroki Naganuma",
+  "github_id": "g4nesh, lauragomezjurado, hiroki11x",
+  "name": "11L + gated Krylov + AR GPTQ int6 + sliding-window eval + XSA + BigramHash",
+  "blurb": "Non-record 16MB submission using the standard SentencePiece GPT path with Parallel Muon plus a small gated Krylov correction on square nonnormal slices. Uses AR self-generated Full-Hessian GPTQ int6, selective +/-1 pruning, and sliding-window evaluation. 26.99M params, 15,957,504-byte artifact, trained on 1xA100 80GB.",
+  "date": "2026-04-06T20:30:00Z",
+  "val_loss": 1.85048306,
+  "val_bpb": 1.09596320,
+  "bytes_total": 15957504,
+  "bytes_code": 139704
+}