Skip to content

Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)#1411

Open
Blakethefn wants to merge 1 commit intoopenai:mainfrom
Blakethefn:submission/blueprint-stack-1gpu
Open

Non-record: Blueprint Stack + ProgSeq + Multi-scale RoPE + ByteEmbed — val_bpb 1.5568 (1xRTX 3080)#1411
Blakethefn wants to merge 1 commit intoopenai:mainfrom
Blakethefn:submission/blueprint-stack-1gpu

Conversation

@Blakethefn
Copy link
Copy Markdown

Summary

Non-record submission exploring a combined technique stack on a single NVIDIA RTX 3080 (12 GB).

val_bpb: 1.5568 (quantized int8+zstd roundtrip, single seed, 10 minutes training)

Architecture

10L, 512d, 8H/4KV GQA, 3x MLP (relu²), tied embeddings, U-Net skip connections, Muon + Adam optimizers, SWA. Additions over baseline:

  • Progressive sequence length: 0:512, 0.35:1024, 0.7:2048
  • Multi-scale RoPE by KV group: bases [1K, 10K, 100K, 1M]
  • Byte-level token embeddings: dim 64 side channel from UTF-8 bytes
  • Mixed-bit quantization: int5 MLP, int6 attn/bigram/byte + zstd

3,647 steps at 164.5 ms/step, loss still dropping at wallclock cap. Artifact: 15.9 MB.

Ablation Results

Systematic ablation of 12 leaderboard techniques across 3 phases on single RTX 3080:

  • Phase 1: Tested XSA, Partial RoPE, LN Scale, EMA, LeakyReLU on simple baseline. Only LeakyReLU passed (-0.003 bpb). All others hurt due to step-time overhead.
  • Phase 2: 5 additive techniques on LeakyReLU baseline — all failed.
  • Phase 3: LeakyReLU on blueprint stack — no benefit. torch.compile(fullgraph=True) generates 6x memory / 26x slower kernels for F.leaky_relu.

Key finding: Most 8xH100-proven techniques fail on single GPU because per-step overhead reduces total training steps at the 10-minute cap. On challenge hardware (~7,100 steps vs 3,647), techniques like XSA, EMA, and Partial RoPE should provide meaningful gains.

Hardware Note

Trained on 1x NVIDIA RTX 3080 (12 GB), not 8xH100. Score is not competitive with leaderboard but demonstrates a viable technique stack and systematic research methodology.

…(1xRTX 3080)

val_bpb 1.5568 on single RTX 3080 (12 GB). 10L blueprint stack with
progressive sequence length, grouped multi-scale RoPE, byte-level token
embeddings, and mixed-bit export. 3647 steps in 10 min, loss still dropping.

Includes ablation results: 12 techniques tested across 3 phases. Most
8xH100-proven techniques hurt on single GPU due to step-time overhead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant