Skip to content

Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean#2060

Open
S0urC10ud wants to merge 2 commits intoopenai:mainfrom
S0urC10ud:submission/noqv-lqer-g32-top4-tttlocal080-v2
Open

Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean#2060
S0urC10ud wants to merge 2 commits intoopenai:mainfrom
S0urC10ud:submission/noqv-lqer-g32-top4-tttlocal080-v2

Conversation

@S0urC10ud
Copy link
Copy Markdown

Summary

This PR adds a 10min/16MB record based on a five-knob hyperparameter retune of PR #2007's LongCtx No-QV QK5.25 + AsymLogit configuration.
The submission keeps the parent architecture, optimizer, dataset, tokenizer, TTT/eval pipeline, quantizer, and compressor byte-for-byte unchanged — train_gpt.py is byte-identical to #2007 (md5 2a7e36e29aa5b5811abb6170059aa8d1). Only five env-var scalars are retuned.

Record folder:

records/track_10min_16mb/2026-05-01_LongCtx_NoQV_QK525_AsymLogit_LQERg32top4_TTTlocal080_1.0579/

Results

Seed Stop step Train time Final TTT BPB Artifact bytes
42 4868 596.142 s 1.05781454 15,971,753
0 4861 595.821 s 1.05798212 15,971,492
1234 4873 595.991 s 1.05796494 15,971,748
Mean 1.05792053
Std 0.00007528

All seeds satisfy the 10-minute / 16 MB rules:

  • train_wallclock_s ≤ 600 ✓ (595.8 – 596.1 s)
  • TTT phased eval_time_s ≤ 600 ✓ (395.4 – 397.6 s)
  • Total submission size ≤ 16,000,000 B ✓ (15,971,492 – 15,971,753 B)
  • Artifact slack ≥ 28,247 B on every seed

What changed vs parent #2007

Five env-var deltas only:

Knob Parent #2007 This PR Direction
MATRIX_LR 0.026 0.028 slightly higher matrix LR
LQER_RANK 4 2 half-rank LQER correctors
LQER_ASYM_GROUP 64 32 finer asym-quant groups
LQER_TOP_K 3 4 one extra top-K corrector slot
TTT_LOCAL_LR_MULT 0.75 0.80 slightly hotter local TTT step

Comparison vs parent #2007 (paired, same 3 seeds)

Seed Parent #2007 BPB This PR BPB Δ BPB
42 1.05857451 1.05781454 −0.00076
0 1.05915199 1.05798212 −0.00117
1234 1.05924929 1.05796494 −0.00128
Mean 1.05899193 1.05792053 −0.00107

Paired one-sided t-test: mean Δ_loss = −0.00234 nats, t = −6.73, p ≈ 0.011.

Comparison vs currently-merged SOTA #1493 (1.0810)

Δ_BPB ≈ −0.0231, Δ_nats ≈ −0.051. Every individual seed improves by ≥ 0.022 BPB, far above the 0.005-nat record threshold.

Method

The frozen parent recipe (unchanged here):

  • CaseOps/SP8192 tokenization with byte-sidecar BPB accounting.
  • Sparse attention gating, BOS-fixed SmearGate, skip gates, LQER correction, int7 embeddings, and mixed-precision GPTQ + AWQ-lite.
  • 2560-token eval and TTT windows.
  • No-QV TTT masking, keeping K/O/MLP adaptation active.
  • TTT_LORA_RANK=80, PHASED_TTT_PREFIX_DOCS=3000.
  • QK_GAIN_INIT=5.25, WARMDOWN_FRAC=0.85, MIN_LR=0.1.
  • Eval-only asymmetric logit rescale.
  • Per-group lrzip -L 9 compression.

The five-knob retune was chosen by an MN5 single-node sweep on top of the #2007 parent stack.

Reproduction

Prepare the CaseOps dataset once:

python prepare_caseops_data.py --local-dir /workspace/caseops_data

Run a seed from this folder:

SEED=42 \
CASEOPS_ROOT=/workspace/caseops_data \
RUN_ID=longctx_noqv_qk525_asym_lqer_g32_top4_tttlocal080_seed42 \
./run_current_candidate.sh

The script sets the full environment and runs:

torchrun --standalone --nproc_per_node=8 train_gpt.py

Repeat with SEED=0 and SEED=1234 for the matched 3-seed validation.

Logs

  • train_seed42.log — final BPB 1.05781454
  • train_seed0.log — final BPB 1.05798212
  • train_seed1234.log — final BPB 1.05796494

Hardware / software

3-seed mean val_bpb 1.05769 (std 0.00041) on 8xH100 80GB SXM. Forks
PR openai#2007 (1.0590); env-only retune of MATRIX_LR=0.028, LQER_RANK=2,
LQER_TOP_K=4, LQER_ASYM_GROUP=32, TTT_LOCAL_LR_MULT=0.80. train_gpt.py
byte-identical to parent. Improves merged SOTA openai#1493 (1.0810) by
~0.023 BPB / ~0.051 nats; paired vs openai#2007 yields p=0.002 but only
0.00293-nat magnitude (below 0.005-nat bar) so non-record vs openai#2007.
Corrected seed-42 Final TTT BPB: 1.05781454 (was 1.05711454).
3-seed mean: 1.05792053, std: 0.00007528.
Directory renamed 1.0577 -> 1.0579 to match corrected mean.
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
Recent sweep logs (named):
S55: token-only ngram tilt baseline = 1.05814 (legal per PR openai#1514)
S56: + 3 openai#2060 levers = 1.05790 (-0.00024)
S57: + AsymLogit only = 1.05759 (-0.00055)
S58: full stack = 1.05694 single seed (-0.00120, super-additive +0.00041 synergy)
S59: S58 + EVAL_SEQ_LEN=3072 + NUM_PHASES=1 + WD=1.0 = 1.05657 single seed, eval 567s
S60 OOM: S59 + EMA_DECAY=0.9 + batch=64 = OOM
S60 retry: S58 + EMA_DECAY=0.9 + batch=32 = 1.05795 / 832s NON-COMPLIANT
S61: S59 + TOKEN_BOOST=3.0 = 1.05678 single seed, eval 501s
S62: S58 + NUM_PHASES=2 + WD=2.0 + eval=2816 = 1.05755

Earlier sweep logs (UUID-named): ~83 files covering S15-S54 sprint history.

Key findings:
- AsymLogit Rescale: 2 trainable scalars (softcap_pos, softcap_neg) give -0.00055 via global TTT polish
- Token-only n-gram tilt confirmed legal per PR openai#1514 (within_tau=99, word_tau=99, agree=0)
- 3 openai#2060 env-var levers (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) stack super-additively
- EMA_DECAY=0.9 didn't transfer to our base
- NUM_PHASES=2 revert costs more pre-quant than it gains in TTT recovery
- Discovered val_tokens=47852544 vs canonical 47853343, need EVAL_INCLUDE_TAIL=1 for clean comparison

Added .gitignore for final_model.pt (130MB - over GitHub limit), .so binaries, pid files.
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request May 1, 2026
Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB.
Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB.
Beats PR openai#2060 (1.05792) by 0.00122 BPB.

Stack:
- Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled)
- AsymLogit Rescale (2 trainable scalars adapted by global TTT)
- 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5)
- PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014)
- NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514)

Compliance:
- All seeds eval ≤533.1s (cap 600s, 67-80s margin)
- All artifacts ≤15.95MB (cap 16MB)
- Token-only n-gram channel (within_gate=0, word_gate=0)
- Score-first TTT (per PR openai#402)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants