Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
This submission packages a near-frontier 10-minute run as a `track_non_record_16mb` entry.

It is intentionally submitted as a non-record result under the current rules. The run is under the decimal `16,000,000`-byte artifact cap, trains on `8xH100 SXM`, and evaluates cleanly, but it does not claim a new public SOTA because the live open-PR frontier is already slightly lower than this result and this package only includes one full leaderboard-grade seed. Under the current README rules, record submissions should beat the current SOTA PR by at least `0.005` nats with sufficient significance.

What this run is:
- A faithful reproduction of the public PR315-style 11-layer transformer line on RunPod `8xH100 SXM`, with native Hopper FlashAttention and `torch.compile`
- One cheap orthogonal addition: a learned Backout residual subtraction from the mid-network hidden state
- Seed `2025`, full `80`-shard SP-1024 training set, `600s` training cap, stride-64 sliding-window evaluation

Exact result:
- `final_int6_sliding_window_exact val_loss: 1.89896029`
- `final_int6_sliding_window_exact val_bpb: 1.12467423`
- `step_stop: 7048`
- `train_time: 600037ms`

Artifact accounting:
- Compressed model (`int6+zstd`): `15,472,918` bytes
- Submitted `train_gpt.py`: `72,744` bytes
- Total packaged artifact size: `15,545,662` bytes

Important note on code bytes:
- The original experiment log reports `Code size: 69975 bytes`, because the experiment version imported a sibling `flash_attn_interface.py`.
- For this submission folder, that helper has been inlined into `train_gpt.py` so the record is self-contained and more closely follows the repo guidance that counted code should live in `train_gpt.py`.
- The underlying model artifact is unchanged; only the packaged code bytes increase slightly.

Run details from `train.log`:
- Backend proof: `flash_attn_backend:native`
- Compile proof: `torch_compile:True`
- Stable throughput: about `85.14ms/step`
- Peak memory: `20693 MiB allocated`, `20748 MiB reserved`
- Post-quant roundtrip exact metric: `val_bpb: 1.14823337`
- Sliding-window exact metric: `val_bpb: 1.12467423`

Track-relevant command:
```bash
OMP_NUM_THREADS=1 \
RUN_ID=runpod-pr315-backout-seed2025-20260322 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
REQUIRE_NATIVE_FLASH_ATTN=1 \
ENABLE_TORCH_COMPILE=1 \
ITERATIONS=9000 \
VAL_LOSS_EVERY=0 \
TRAIN_LOG_EVERY=200 \
TRAIN_BATCH_TOKENS=786432 \
VAL_BATCH_SIZE=524288 \
MAX_WALLCLOCK_SECONDS=600 \
TRAIN_SEQ_LEN=2048 \
EVAL_SEQ_LEN=2048 \
EVAL_STRIDE=64 \
NUM_LAYERS=11 \
BIGRAM_VOCAB_SIZE=2048 \
XSA_LAST_N=4 \
EMA_ENABLED=1 \
EMA_DECAY=0.997 \
SWA_ENABLED=0 \
ROPE_DIMS=16 \
LN_SCALE=1 \
LATE_QAT=1 \
BACKOUT_ENABLED=1 \
BACKOUT_LAMBDA_INIT=0.2 \
BACKOUT_LAYER=-1 \
QAT_THRESHOLD=0.1 \
MUON_WD=0.04 \
ADAM_WD=0.04 \
MATRIX_LR=0.025 \
SCALAR_LR=0.025 \
TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 \
MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 \
WARMDOWN_ITERS=3000 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

Setup notes:
- This run was produced on the official RunPod Parameter Golf image with a native Hopper FlashAttention install available on the machine.
- The submitted script will run with fallback SDPA for local smoke tests if `REQUIRE_NATIVE_FLASH_ATTN=0`, but a faithful reproduction of this score expects native FA3 on `8xH100 SXM`.
- If you are self-provisioning instead of using the official template, install the Python packages in `requirements.txt` and make sure native Hopper FlashAttention is available to Python.

Included files:
- `README.md`
- `submission.json`
- `train.log`
- `train_gpt.py`
- `requirements.txt`
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
numpy
torch
sentencepiece
zstandard
flash-attn
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"author": "GreQ",
"github_id": "greqone",
"name": "11L PR315 Backout + Native FA3",
"blurb": "Non-record 10-minute track submission: a faithful PR315-style 11-layer Partial-RoPE + LN-Scale + EMA + XSA4 run on 8xH100 SXM with native Hopper FlashAttention and one cheap Backout residual, reaching 1.12467423 val_bpb under the 16,000,000-byte artifact cap.",
"date": "2026-03-22T02:25:59Z",
"track": "non-record-10min-16mb",
"val_loss": 1.89896029,
"val_bpb": 1.12467423,
"roundtrip_val_loss": 1.93874395,
"roundtrip_val_bpb": 1.14823337,
"step_stop": 7048,
"wallclock_seconds": 600.037,
"eval_seconds_roundtrip": 35.625,
"eval_seconds_sliding": 90.736,
"bytes_total": 15545662,
"bytes_model_int6_zstd": 15472918,
"bytes_code": 72744,
"seed": 2025,
"train_shards": 80,
"gpu": "8xH100 SXM (RunPod)",
"notes": "Submitted as a non-record result under the current rules because the live public open-PR frontier is already slightly lower than this score and this package contains a single full run rather than a significance set."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
logs/runpod-pr315-backout-seed2025-20260322.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:26829914
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
flash_attn_backend:native
torch_compile:True
attention_mode:gqa num_heads:8 num_kv_heads:4
backout_enabled:True backout_layer:5 backout_lambda_init:0.2
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:9000 warmup_steps:20 max_wallclock_seconds:600.000
seed:2025
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:1/9000 train_loss:6.9287 train_time:215ms step_avg:214.96ms
step:2/9000 train_loss:8.6073 train_time:285ms step_avg:142.56ms
step:3/9000 train_loss:7.8361 train_time:368ms step_avg:122.62ms
step:4/9000 train_loss:7.1846 train_time:451ms step_avg:112.70ms
step:5/9000 train_loss:7.0081 train_time:533ms step_avg:106.68ms
step:6/9000 train_loss:6.9637 train_time:616ms step_avg:102.65ms
step:7/9000 train_loss:6.8995 train_time:699ms step_avg:99.82ms
step:8/9000 train_loss:6.8539 train_time:782ms step_avg:97.77ms
step:9/9000 train_loss:6.4929 train_time:865ms step_avg:96.10ms
step:10/9000 train_loss:6.1303 train_time:948ms step_avg:94.78ms
step:200/9000 train_loss:2.4466 train_time:17110ms step_avg:85.55ms
step:400/9000 train_loss:2.4483 train_time:34183ms step_avg:85.46ms
step:600/9000 train_loss:2.3563 train_time:51173ms step_avg:85.29ms
step:800/9000 train_loss:2.2488 train_time:68238ms step_avg:85.30ms
step:1000/9000 train_loss:2.2795 train_time:85211ms step_avg:85.21ms
step:1200/9000 train_loss:2.3573 train_time:102270ms step_avg:85.22ms
step:1400/9000 train_loss:2.1855 train_time:119332ms step_avg:85.24ms
step:1600/9000 train_loss:2.0734 train_time:136299ms step_avg:85.19ms
step:1800/9000 train_loss:2.1535 train_time:153339ms step_avg:85.19ms
step:2000/9000 train_loss:2.0610 train_time:170319ms step_avg:85.16ms
step:2200/9000 train_loss:2.1335 train_time:187352ms step_avg:85.16ms
step:2400/9000 train_loss:2.0601 train_time:204298ms step_avg:85.12ms
step:2600/9000 train_loss:2.1025 train_time:221330ms step_avg:85.13ms
step:2800/9000 train_loss:2.1496 train_time:238399ms step_avg:85.14ms
step:3000/9000 train_loss:2.1566 train_time:255379ms step_avg:85.13ms
step:3200/9000 train_loss:2.1713 train_time:272442ms step_avg:85.14ms
step:3400/9000 train_loss:2.0193 train_time:289421ms step_avg:85.12ms
step:3600/9000 train_loss:2.0964 train_time:306493ms step_avg:85.14ms
step:3800/9000 train_loss:2.0769 train_time:323474ms step_avg:85.12ms
step:4000/9000 train_loss:1.9828 train_time:340521ms step_avg:85.13ms
step:4200/9000 train_loss:2.1580 train_time:357574ms step_avg:85.14ms
step:4400/9000 train_loss:2.0403 train_time:374530ms step_avg:85.12ms
step:4600/9000 train_loss:1.8493 train_time:391592ms step_avg:85.13ms
step:4800/9000 train_loss:2.4337 train_time:408572ms step_avg:85.12ms
step:5000/9000 train_loss:2.1064 train_time:425606ms step_avg:85.12ms
step:5200/9000 train_loss:2.0472 train_time:442571ms step_avg:85.11ms
step:5400/9000 train_loss:2.0536 train_time:459619ms step_avg:85.11ms
step:5600/9000 train_loss:1.9580 train_time:476683ms step_avg:85.12ms
step:5800/9000 train_loss:2.0010 train_time:493661ms step_avg:85.11ms
step:6000/9000 train_loss:1.9442 train_time:510730ms step_avg:85.12ms
step:6200/9000 train_loss:1.9558 train_time:527716ms step_avg:85.12ms
step:6400/9000 train_loss:2.0030 train_time:544866ms step_avg:85.14ms
step:6600/9000 train_loss:1.8454 train_time:561847ms step_avg:85.13ms
late_qat:enabled step:6748 scale:0.0997
step:6800/9000 train_loss:2.0268 train_time:578908ms step_avg:85.13ms
step:7000/9000 train_loss:1.7881 train_time:595962ms step_avg:85.14ms
step:7048/9000 val_loss:1.9275 val_bpb:1.1416 train_time:600037ms step_avg:85.14ms
stopping_early: wallclock_cap train_time:600037ms step:7048/9000
peak memory allocated: 20693 MiB reserved: 20748 MiB
ema:applying EMA weights
Serialized model: 105784065 bytes
Code size: 69975 bytes
Serialized model int6+zstd: 15472918 bytes
Total submission size int6+zstd: 15542893 bytes
final_int6_roundtrip val_loss:1.9387 val_bpb:1.1482 eval_time:35625ms
final_int6_roundtrip_exact val_loss:1.93874395 val_bpb:1.14823337
final_int6_sliding_window val_loss:1.8990 val_bpb:1.1247 stride:64 eval_time:90736ms
final_int6_sliding_window_exact val_loss:1.89896029 val_bpb:1.12467423
Loading
Loading