Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080) by Blakethefn · Pull Request #1411 · openai/parameter-golf

Blakethefn · 2026-04-06T07:44:28Z

Summary

Non-record submission exploring a combined technique stack on a single NVIDIA RTX 3080 (12 GB).

val_bpb: 1.5568 (quantized int8+zstd roundtrip, single seed, 10 minutes training)

Architecture

10L, 512d, 8H/4KV GQA, 3x MLP (relu²), tied embeddings, U-Net skip connections, Muon + Adam optimizers, SWA. Additions over baseline:

Progressive sequence length: 0:512, 0.35:1024, 0.7:2048
Multi-scale RoPE by KV group: bases [1K, 10K, 100K, 1M]
Byte-level token embeddings: dim 64 side channel from UTF-8 bytes
Mixed-bit quantization: int5 MLP, int6 attn/bigram/byte + zstd

3,647 steps at 164.5 ms/step, loss still dropping at wallclock cap. Artifact: 15.9 MB.

Ablation Results

Systematic ablation of 12 leaderboard techniques across 3 phases on single RTX 3080:

Phase 1: Tested XSA, Partial RoPE, LN Scale, EMA, LeakyReLU on simple baseline. Only LeakyReLU passed (-0.003 bpb). All others hurt due to step-time overhead.
Phase 2: 5 additive techniques on LeakyReLU baseline — all failed.
Phase 3: LeakyReLU on blueprint stack — no benefit. torch.compile(fullgraph=True) generates 6x memory / 26x slower kernels for F.leaky_relu.

Key finding: Most 8xH100-proven techniques fail on single GPU because per-step overhead reduces total training steps at the 10-minute cap. On challenge hardware (~7,100 steps vs 3,647), techniques like XSA, EMA, and Partial RoPE should provide meaningful gains.

Hardware Note

Trained on 1x NVIDIA RTX 3080 (12 GB), not 8xH100. Score is not competitive with leaderboard but demonstrates a viable technique stack and systematic research methodology.

…(1xRTX 3080) val_bpb 1.5568 on single RTX 3080 (12 GB). 10L blueprint stack with progressive sequence length, grouped multi-scale RoPE, byte-level token embeddings, and mixed-bit export. 3647 steps in 10 min, loss still dropping. Includes ablation results: 12 techniques tested across 3 phases. Most 8xH100-proven techniques hurt on single GPU due to step-time overhead.

okezue mentioned this pull request May 1, 2026

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767 #2128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)#1411

Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)#1411
Blakethefn wants to merge 1 commit intoopenai:mainfrom
Blakethefn:submission/blueprint-stack-1gpu

Blakethefn commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Blakethefn commented Apr 6, 2026

Summary

Architecture

Ablation Results

Hardware Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant