Skip to content

Submission/qat bigram12k stride32#348

Open
EthanYangTW wants to merge 15 commits intoopenai:mainfrom
EthanYangTW:submission/qat-bigram12k-stride32
Open

Submission/qat bigram12k stride32#348
EthanYangTW wants to merge 15 commits intoopenai:mainfrom
EthanYangTW:submission/qat-bigram12k-stride32

Conversation

@EthanYangTW
Copy link
Copy Markdown

@EthanYangTW EthanYangTW commented Mar 21, 2026

QAT + BigramHash(12288) + Stride 32 — 1.1444 bpb

Body:

Summary

  • QAT with STE (int5 MLP / int6 attn) reduces post-quantization degradation
  • BigramHash increased from 10240 to 12288
  • Eval stride reduced from 64 to 32
  • Magnitude pruning 5%, SWA every 25 steps
  • Artifact: 15.90MB

Results

  • seed=2024: val_bpb=1.14443
  • 8xH100 SXM, 6549 steps in 600s

Base

Built on records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50

Based on SOTA (10L_Int5MLP_MuonWD04_SWA50) with improvements:
- QAT with STE for int5/int6 quantization-aware training
- BigramHash increased from 10240 to 12288
- Eval stride reduced from 64 to 32 for better context
- Magnitude pruning increased from 3% to 5%
- SWA every 25 steps instead of 50
- Artifact size: ~15.89MB (under 16MB limit)
Restore original train_gpt.py baseline. Add new records folder with
submission script based on 10L_Int5MLP_MuonWD04_SWA50 SOTA.

Changes: QAT with STE, BigramHash 12288, eval stride 32,
5% magnitude pruning, SWA every 25 steps.
Port LoRA TTT from records/2026-03-17_LoRA_TTT into our submission.
At eval time, per-document rank-8 LoRA adapters are trained on Q/V
projections and lm_head, then used for scoring. Expected -0.003 to
-0.005 bpb improvement on top of sliding window eval.
val_bpb=1.14443 (seed=2024), artifact=15.90MB
Copilot AI review requested due to automatic review settings March 21, 2026 15:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new /records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32 submission artifact folder capturing a training run and the exact code/config used for a QAT + BigramHash(12K) + stride-32 sliding-window evaluation entry.

Changes:

  • Adds a full train_gpt.py snapshot implementing QAT (STE fake-quant), BigramHash embeddings, mixed int5/int6 quantization, pruning, SWA, and sliding-window eval.
  • Adds a train_seed2024.log run log and a short README.md describing the approach/results.
  • Adds submission.json metadata (reported val_loss/bytes_total/date/author).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py Training + quantization + export + eval script snapshot for this record submission.
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_seed2024.log Captured training/eval log for seed 2024 and reported final metrics.
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json Leaderboard metadata for the submission (name, loss, size, blurb).
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md Human-readable summary, config highlights, and run command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- **BigramHash 12288:** Increased from 10240 to 12288 buckets for better bigram coverage.
- **Eval stride 32:** Reduced from 64 to 32 for more overlapping context windows during evaluation.
- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
- **SWA every 25 steps:** More frequent checkpoint averaging during warmdown.
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README claims “SWA every 25 steps”, but the actual default in train_gpt.py is swa_every = ... 50 (and the included log shows SWA starting at step 5400, consistent with 50-step cadence). Please either update the README to match the code, or change the default/command/env vars so the run truly uses SWA every 25 steps and regenerate the log/metrics accordingly.

Suggested change
- **SWA every 25 steps:** More frequent checkpoint averaging during warmdown.
- **SWA every 50 steps:** Checkpoint averaging during warmdown.

Copilot uses AI. Check for mistakes.
"name": "QAT + BigramHash(12288) + Stride 32",
"val_loss": 1.14443,
"bytes_total": 15902583,
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json blurb states “SWA every 25 steps”, but train_gpt.py defaults to SWA_EVERY=50. For reproducibility, please align the blurb with the actual run configuration (or adjust the code/run to match the blurb and update the reported metrics if they change).

Suggested change
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +26
except ImportError:
_COMPRESSOR = "zlib"
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script silently falls back to zlib when zstandard isn’t installed, but the record metadata/README call out “zstd-22” and the reported bytes_total depends on the compressor. For reproducibility, consider failing fast when zstandard is missing (or at least reflecting the fallback clearly in README/submission metadata and the logged size label).

Suggested change
except ImportError:
_COMPRESSOR = "zlib"
except ImportError as exc:
raise RuntimeError(
"The `zstandard` package is required for this script to run reproducibly "
"with the documented 'zstd-22' compression. Please install it with "
"`pip install zstandard` and try again."
) from exc

Copilot uses AI. Check for mistakes.
Comment thread records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py Outdated
EthanYangTW and others added 2 commits March 21, 2026 23:27
…/train_gpt.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…le, EMA, Late QAT, TTT

Major rewrite targeting top-5 leaderboard:
- 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB
- XSA (Exclusive Self-Attention) on last 4 layers
- Partial RoPE: 16/64 head dims get position encoding
- LN Scale: 1/sqrt(layer+1) dampening on deeper layers
- EMA (decay=0.997) replaces SWA
- Late QAT: STE int6 enabled only in final 4% of training
- TTT: 25-epoch SGD on val data post-quantization
- FA3 auto-detection with SDPA fallback
- Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
…y 10 steps

- Disable FA3 (SDPA faster for GQA on PyTorch 2.9)
- BigramHash 10240 -> 8192 to fit 11L under 16MB
- EMA update every 10 steps with adjusted decay to reduce CPU overhead
- Simplify attention forward (remove FA3 code path)
Previous run: 16.94MB with BigramHash 8192 + 5% pruning.
BigramHash 2048 saves ~0.5MB, 10% pruning improves compression further.
v3 was 16.38MB with BigramHash 2048 + 10% pruning.
Removing BigramHash saves ~0.15MB, 15% pruning improves zstd compression.
Fork of unnir's openai#374 (1.1246 BPB) with TTT added:
- 11L, XSA4, Partial RoPE 16/64, LN Scale, Tight SWA
- Shared VE128, SmearGate, BigramHash 2048
- TTT: 25 epochs SGD on val data post-quantization
- Trimmed to 1476 lines (under 1500 limit)
Previous TTT took 7+ min per epoch (uncompiled, single GPU).
Now: torch.compile + DDP across 8 GPUs + 3 epochs + batch 64.
Should finish in ~2-3 min total.
flash_attn_interface (FA3 Hopper) not available on RunPod.
Falls back to flash_attn, then SDPA with GQA support.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Submission/qat bigram12k stride32

BPB: 1.1444 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA e83a2778a4b5, file records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=11, vocab=1024, code=60407 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.02s, dim=512, layers=11, vocab=1024, code=60407 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants