openai · greqone · Mar 22, 2026
diff --git a/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/README.md b/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/README.md
@@ -0,0 +1,86 @@
+This submission packages a near-frontier 10-minute run as a `track_non_record_16mb` entry.
+
+It is intentionally submitted as a non-record result under the current rules. The run is under the decimal `16,000,000`-byte artifact cap, trains on `8xH100 SXM`, and evaluates cleanly, but it does not claim a new public SOTA because the live open-PR frontier is already slightly lower than this result and this package only includes one full leaderboard-grade seed. Under the current README rules, record submissions should beat the current SOTA PR by at least `0.005` nats with sufficient significance.
+
+What this run is:
+- A faithful reproduction of the public PR315-style 11-layer transformer line on RunPod `8xH100 SXM`, with native Hopper FlashAttention and `torch.compile`
+- One cheap orthogonal addition: a learned Backout residual subtraction from the mid-network hidden state
+- Seed `2025`, full `80`-shard SP-1024 training set, `600s` training cap, stride-64 sliding-window evaluation
+
+Exact result:
+- `final_int6_sliding_window_exact val_loss: 1.89896029`
+- `final_int6_sliding_window_exact val_bpb: 1.12467423`
+- `step_stop: 7048`
+- `train_time: 600037ms`
+
+Artifact accounting:
+- Compressed model (`int6+zstd`): `15,472,918` bytes
+- Submitted `train_gpt.py`: `72,744` bytes
+- Total packaged artifact size: `15,545,662` bytes
+
+Important note on code bytes:
+- The original experiment log reports `Code size: 69975 bytes`, because the experiment version imported a sibling `flash_attn_interface.py`.
+- For this submission folder, that helper has been inlined into `train_gpt.py` so the record is self-contained and more closely follows the repo guidance that counted code should live in `train_gpt.py`.
+- The underlying model artifact is unchanged; only the packaged code bytes increase slightly.
+
+Run details from `train.log`:
+- Backend proof: `flash_attn_backend:native`
+- Compile proof: `torch_compile:True`
+- Stable throughput: about `85.14ms/step`
+- Peak memory: `20693 MiB allocated`, `20748 MiB reserved`
+- Post-quant roundtrip exact metric: `val_bpb: 1.14823337`
+- Sliding-window exact metric: `val_bpb: 1.12467423`
+
+Track-relevant command:
+```bash
+OMP_NUM_THREADS=1 \
+RUN_ID=runpod-pr315-backout-seed2025-20260322 \
+DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+REQUIRE_NATIVE_FLASH_ATTN=1 \
+ENABLE_TORCH_COMPILE=1 \
+ITERATIONS=9000 \
+VAL_LOSS_EVERY=0 \
+TRAIN_LOG_EVERY=200 \
+TRAIN_BATCH_TOKENS=786432 \
+VAL_BATCH_SIZE=524288 \
+MAX_WALLCLOCK_SECONDS=600 \
+TRAIN_SEQ_LEN=2048 \
+EVAL_SEQ_LEN=2048 \
+EVAL_STRIDE=64 \
+NUM_LAYERS=11 \
+BIGRAM_VOCAB_SIZE=2048 \
+XSA_LAST_N=4 \
+EMA_ENABLED=1 \
+EMA_DECAY=0.997 \
+SWA_ENABLED=0 \
+ROPE_DIMS=16 \
+LN_SCALE=1 \
+LATE_QAT=1 \
+BACKOUT_ENABLED=1 \
+BACKOUT_LAMBDA_INIT=0.2 \
+BACKOUT_LAYER=-1 \
+QAT_THRESHOLD=0.1 \
+MUON_WD=0.04 \
+ADAM_WD=0.04 \
+MATRIX_LR=0.025 \
+SCALAR_LR=0.025 \
+TIED_EMBED_LR=0.035 \
+MUON_MOMENTUM=0.99 \
+MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=1500 \
+WARMDOWN_ITERS=3000 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+Setup notes:
+- This run was produced on the official RunPod Parameter Golf image with a native Hopper FlashAttention install available on the machine.
+- The submitted script will run with fallback SDPA for local smoke tests if `REQUIRE_NATIVE_FLASH_ATTN=0`, but a faithful reproduction of this score expects native FA3 on `8xH100 SXM`.
+- If you are self-provisioning instead of using the official template, install the Python packages in `requirements.txt` and make sure native Hopper FlashAttention is available to Python.
+
+Included files:
+- `README.md`
+- `submission.json`
+- `train.log`
+- `train_gpt.py`
+- `requirements.txt`
diff --git a/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/requirements.txt b/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/requirements.txt
@@ -0,0 +1,5 @@
+numpy
+torch
+sentencepiece
+zstandard
+flash-attn
diff --git a/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/submission.json b/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/submission.json
@@ -0,0 +1,23 @@
+{
+  "author": "GreQ",
+  "github_id": "greqone",
+  "name": "11L PR315 Backout + Native FA3",
+  "blurb": "Non-record 10-minute track submission: a faithful PR315-style 11-layer Partial-RoPE + LN-Scale + EMA + XSA4 run on 8xH100 SXM with native Hopper FlashAttention and one cheap Backout residual, reaching 1.12467423 val_bpb under the 16,000,000-byte artifact cap.",
+  "date": "2026-03-22T02:25:59Z",
+  "track": "non-record-10min-16mb",
+  "val_loss": 1.89896029,
+  "val_bpb": 1.12467423,
+  "roundtrip_val_loss": 1.93874395,
+  "roundtrip_val_bpb": 1.14823337,
+  "step_stop": 7048,
+  "wallclock_seconds": 600.037,
+  "eval_seconds_roundtrip": 35.625,
+  "eval_seconds_sliding": 90.736,
+  "bytes_total": 15545662,
+  "bytes_model_int6_zstd": 15472918,
+  "bytes_code": 72744,
+  "seed": 2025,
+  "train_shards": 80,
+  "gpu": "8xH100 SXM (RunPod)",
+  "notes": "Submitted as a non-record result under the current rules because the live public open-PR frontier is already slightly lower than this score and this package contains a single full run rather than a significance set."
+}
diff --git a/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train.log b/records/track_non_record_16mb/2026-03-22_11L_PR315_Backout_FA3_RunPod/train.log
@@ -0,0 +1,97 @@
+
+*****************************************
+Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+*****************************************
+logs/runpod-pr315-backout-seed2025-20260322.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26829914
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+flash_attn_backend:native
+torch_compile:True
+attention_mode:gqa num_heads:8 num_kv_heads:4
+backout_enabled:True backout_layer:5 backout_lambda_init:0.2
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:9000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:2025
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:1/9000 train_loss:6.9287 train_time:215ms step_avg:214.96ms
+step:2/9000 train_loss:8.6073 train_time:285ms step_avg:142.56ms
+step:3/9000 train_loss:7.8361 train_time:368ms step_avg:122.62ms
+step:4/9000 train_loss:7.1846 train_time:451ms step_avg:112.70ms
+step:5/9000 train_loss:7.0081 train_time:533ms step_avg:106.68ms
+step:6/9000 train_loss:6.9637 train_time:616ms step_avg:102.65ms
+step:7/9000 train_loss:6.8995 train_time:699ms step_avg:99.82ms
+step:8/9000 train_loss:6.8539 train_time:782ms step_avg:97.77ms
+step:9/9000 train_loss:6.4929 train_time:865ms step_avg:96.10ms
+step:10/9000 train_loss:6.1303 train_time:948ms step_avg:94.78ms
+step:200/9000 train_loss:2.4466 train_time:17110ms step_avg:85.55ms
+step:400/9000 train_loss:2.4483 train_time:34183ms step_avg:85.46ms
+step:600/9000 train_loss:2.3563 train_time:51173ms step_avg:85.29ms
+step:800/9000 train_loss:2.2488 train_time:68238ms step_avg:85.30ms
+step:1000/9000 train_loss:2.2795 train_time:85211ms step_avg:85.21ms
+step:1200/9000 train_loss:2.3573 train_time:102270ms step_avg:85.22ms
+step:1400/9000 train_loss:2.1855 train_time:119332ms step_avg:85.24ms
+step:1600/9000 train_loss:2.0734 train_time:136299ms step_avg:85.19ms
+step:1800/9000 train_loss:2.1535 train_time:153339ms step_avg:85.19ms
+step:2000/9000 train_loss:2.0610 train_time:170319ms step_avg:85.16ms
+step:2200/9000 train_loss:2.1335 train_time:187352ms step_avg:85.16ms
+step:2400/9000 train_loss:2.0601 train_time:204298ms step_avg:85.12ms
+step:2600/9000 train_loss:2.1025 train_time:221330ms step_avg:85.13ms
+step:2800/9000 train_loss:2.1496 train_time:238399ms step_avg:85.14ms
+step:3000/9000 train_loss:2.1566 train_time:255379ms step_avg:85.13ms
+step:3200/9000 train_loss:2.1713 train_time:272442ms step_avg:85.14ms
+step:3400/9000 train_loss:2.0193 train_time:289421ms step_avg:85.12ms
+step:3600/9000 train_loss:2.0964 train_time:306493ms step_avg:85.14ms
+step:3800/9000 train_loss:2.0769 train_time:323474ms step_avg:85.12ms
+step:4000/9000 train_loss:1.9828 train_time:340521ms step_avg:85.13ms
+step:4200/9000 train_loss:2.1580 train_time:357574ms step_avg:85.14ms
+step:4400/9000 train_loss:2.0403 train_time:374530ms step_avg:85.12ms
+step:4600/9000 train_loss:1.8493 train_time:391592ms step_avg:85.13ms
+step:4800/9000 train_loss:2.4337 train_time:408572ms step_avg:85.12ms
+step:5000/9000 train_loss:2.1064 train_time:425606ms step_avg:85.12ms
+step:5200/9000 train_loss:2.0472 train_time:442571ms step_avg:85.11ms
+step:5400/9000 train_loss:2.0536 train_time:459619ms step_avg:85.11ms
+step:5600/9000 train_loss:1.9580 train_time:476683ms step_avg:85.12ms
+step:5800/9000 train_loss:2.0010 train_time:493661ms step_avg:85.11ms
+step:6000/9000 train_loss:1.9442 train_time:510730ms step_avg:85.12ms
+step:6200/9000 train_loss:1.9558 train_time:527716ms step_avg:85.12ms
+step:6400/9000 train_loss:2.0030 train_time:544866ms step_avg:85.14ms
+step:6600/9000 train_loss:1.8454 train_time:561847ms step_avg:85.13ms
+late_qat:enabled step:6748 scale:0.0997
+step:6800/9000 train_loss:2.0268 train_time:578908ms step_avg:85.13ms
+step:7000/9000 train_loss:1.7881 train_time:595962ms step_avg:85.14ms
+step:7048/9000 val_loss:1.9275 val_bpb:1.1416 train_time:600037ms step_avg:85.14ms
+stopping_early: wallclock_cap train_time:600037ms step:7048/9000
+peak memory allocated: 20693 MiB reserved: 20748 MiB
+ema:applying EMA weights
+Serialized model: 105784065 bytes
+Code size: 69975 bytes
+Serialized model int6+zstd: 15472918 bytes
+Total submission size int6+zstd: 15542893 bytes
+final_int6_roundtrip val_loss:1.9387 val_bpb:1.1482 eval_time:35625ms
+final_int6_roundtrip_exact val_loss:1.93874395 val_bpb:1.14823337
+final_int6_sliding_window val_loss:1.8990 val_bpb:1.1247 stride:64 eval_time:90736ms
+final_int6_sliding_window_exact val_loss:1.89896029 val_bpb:1.12467423