Submission/sp8192 depthrecur adamwttt#1619
Conversation
Architecture: 10 layers, 512 dim, GQA 8h/4kv, MLP 3x (1536), LeakyReLU(0.5)^2 Training: Muon WD=0.04, EMA(0.997), Int6 STE QAT at 80%, LZMA preset=9 Eval: Sliding window stride=256 on EMA weights post int6 roundtrip Estimated BPB: ~1.15-1.16 (pending GPU run)
… — actual submission is 11L)
There was a problem hiding this comment.
Pull request overview
Adds new 10min_16mb training/eval scripts and record metadata for two runs, including a new SP8192 + depth recurrence + doc-level LoRA test-time-training (TTT) variant and a prior FullGPTQ/XSA/BigramHash baseline with run scripts.
Changes:
- Add a new 2026-04-14 SP8192 model/training pipeline including depth recurrence, parallel residual blocks, and document-aware AdamW LoRA TTT evaluation.
- Add submission metadata and README documentation for the 2026-04-14 record.
- Add a 2026-04-08 baseline record including train script, submission metadata, README, and runnable shell scripts.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/train_gpt.py | New end-to-end training + eval + GPTQ quantization pipeline with added depth recurrence, parallel residuals, and doc-aware LoRA TTT eval. |
| records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/submission.json | Adds submission metadata for the new SP8192/depth-recur/doc-TTT run. |
| records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/README.md | Documents architecture/training/quantization/eval claims for the new run. |
| records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/train_gpt.py | Adds baseline training + GPTQ pipeline script for the 2026-04-08 record. |
| records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/submission.json | Adds baseline submission metadata with results and environment details. |
| records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/run_smoke_1gpu.sh | Adds 1-GPU smoke test runner for the baseline pipeline. |
| records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/run_leaderboard_8xh100.sh | Adds 8×H100 leaderboard runner for the baseline pipeline. |
| records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/README.md | Documents baseline architecture/training/quantization and how to run. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # SP8192 + Depth Recurrence + VarLen Attention + Doc-LoRA TTT | ||
|
|
||
| ## Architecture | ||
|
|
||
| - **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency | ||
| - **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2) |
There was a problem hiding this comment.
The README title/architecture mentions “VarLen Attention”, but the training script doesn’t use any varlen attention kernel (and _flash_attn_varlen_func is unused). This makes the described feature set misleading; please either implement varlen attention or update the README to reflect the actual attention implementation.
| # SP8192 + Depth Recurrence + VarLen Attention + Doc-LoRA TTT | |
| ## Architecture | |
| - **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency | |
| - **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2) | |
| # SP8192 + Depth Recurrence + Doc-LoRA TTT | |
| ## Architecture | |
| - **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency | |
| - **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2) | |
| - **Attention implementation**: Standard causal attention; no variable-length attention kernel is currently enabled in training |
| { | ||
| "track": "10min_16mb", | ||
| "tokenizer": "sp8192", | ||
| "architecture": "11L-512d-8H4KV-GQA-MLP3x-UNet-BigramHash3072-XSA11-VE128-DepthRecur-ParResid-DocTTT", | ||
| "features": [ | ||
| "SP8192 tokenizer (8192 vocab BPE)", | ||
| "Depth recurrence on layers 3-5 (2 passes, gated blend)", | ||
| "Parallel residuals GPT-J style on layers 7+", | ||
| "QK-Gain 5.25", | ||
| "BigramHash 3072x112 with XOR hash", | ||
| "XSA on all 11 layers", | ||
| "SmearGate temporal smoothing", | ||
| "Value Embedding VE128 on layers 9-10", | ||
| "Partial RoPE 16/64 dims", | ||
| "U-Net encoder-decoder skip connections", | ||
| "Tied embeddings (std=0.005)", | ||
| "Muon optimizer (momentum=0.97, WD=0.095)", | ||
| "EMA decay=0.997 + Tight SWA every 50 steps", | ||
| "Late QAT at LR<15%", | ||
| "Warmdown fraction=0.75", | ||
| "Full Hessian GPTQ int6 (AR calib 64x2048, block=128)", | ||
| "Selective ±1 pruning + LZMA-9", | ||
| "Score-first AdamW Doc-LoRA TTT (chunk=64, rank=8, all layers)" | ||
| ], | ||
| "val_bpb": null, | ||
| "notes": "All code written from scratch. AdamW TTT is novel — nobody else uses AdamW for test-time LoRA adaptation." | ||
| } |
There was a problem hiding this comment.
submission.json format appears inconsistent with other records/track_10min_16mb/*/submission.json entries (e.g., 2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/submission.json includes author, github_id, name, date, val_bpb, seed_results, etc.). If this file is consumed by tooling, missing standard fields (and val_bpb: null) may break parsers; consider matching the established schema used by other submissions.
| try: | ||
| from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func | ||
| _HAS_FA_VARLEN = True | ||
| except ImportError: | ||
| _HAS_FA_VARLEN = False |
There was a problem hiding this comment.
flash_attn_varlen_func is imported and _HAS_FA_VARLEN is set, but the file never uses _flash_attn_varlen_func (the only reference is the import). Either remove the unused import/flag (and any related logging/docs), or add the intended varlen attention path so the “VarLen Attention” claim is accurate.
| """Find document boundaries based on token value wrapping / BOS detection. | ||
| Returns list of (start, end) index pairs.""" | ||
| # Simple heuristic: treat token 0 or 1 as BOS markers, or detect large | ||
| # discontinuities. For FineWeb with SP tokenizer, documents are packed | ||
| # contiguously. We split on token_id == 1 (BOS). | ||
| boundaries = [0] | ||
| t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy() | ||
| for i in range(1, len(t)): | ||
| if t[i] == 1: # BOS token | ||
| boundaries.append(i) |
There was a problem hiding this comment.
Document boundary detection is hard-coded to token_id == 1 as BOS. SentencePiece BOS/CLS ids are model-dependent (and some datasets may not insert BOS at all), so this can split documents incorrectly or not at all. Consider using sp.bos_id() (or making BOS id configurable via env var) and/or validating that the packed dataset actually contains boundary markers before enabling doc-level TTT.
| """Find document boundaries based on token value wrapping / BOS detection. | |
| Returns list of (start, end) index pairs.""" | |
| # Simple heuristic: treat token 0 or 1 as BOS markers, or detect large | |
| # discontinuities. For FineWeb with SP tokenizer, documents are packed | |
| # contiguously. We split on token_id == 1 (BOS). | |
| boundaries = [0] | |
| t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy() | |
| for i in range(1, len(t)): | |
| if t[i] == 1: # BOS token | |
| boundaries.append(i) | |
| """Find document boundaries using a configured BOS marker when available. | |
| Returns list of (start, end) index pairs.""" | |
| # Document-level TTT requires reliable boundary markers in the packed token | |
| # stream. SentencePiece BOS ids are tokenizer-dependent, and some datasets | |
| # may not include BOS markers at all, so do not hard-code a token id here. | |
| t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy() | |
| if len(t) == 0: | |
| return [] | |
| bos_id_env = os.getenv("DOC_TTT_BOS_ID") | |
| bos_id = None | |
| if bos_id_env is not None and bos_id_env != "": | |
| try: | |
| bos_id = int(bos_id_env) | |
| except ValueError: | |
| bos_id = None | |
| if bos_id is None: | |
| return [(0, len(t))] | |
| if vocab_size is not None and not (0 <= bos_id < vocab_size): | |
| return [(0, len(t))] | |
| boundaries = [0] | |
| for i in range(1, len(t)): | |
| if int(t[i]) == bos_id: | |
| boundaries.append(i) | |
| # If the packed dataset does not contain the configured BOS marker, treat | |
| # the full span as a single document rather than splitting incorrectly. | |
| if len(boundaries) == 1: | |
| return [(0, len(t))] |
| # Split validation tokens across ranks | ||
| rank_start = (total_tokens * rank) // world_size | ||
| rank_end = (total_tokens * (rank + 1)) // world_size | ||
| my_tokens = val_tokens[rank_start:rank_end + 1] | ||
|
|
||
| # Find document boundaries in our shard | ||
| doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size) | ||
|
|
There was a problem hiding this comment.
eval_val_doc_ttt shards val_tokens by raw token index (rank_start/rank_end) and then detects document boundaries only inside that shard. If a document crosses a shard boundary, ranks will reset LoRA state mid-document, making distributed results differ from single-process evaluation. To keep eval deterministic/correct, partition work by document ranges (global boundary scan) or run doc-TTT eval on rank 0 only and all-reduce the final metrics.
| # Split validation tokens across ranks | |
| rank_start = (total_tokens * rank) // world_size | |
| rank_end = (total_tokens * (rank + 1)) // world_size | |
| my_tokens = val_tokens[rank_start:rank_end + 1] | |
| # Find document boundaries in our shard | |
| doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size) | |
| # Doc-TTT evaluation must preserve full-document state. Sharding by raw | |
| # token index can split a document across ranks, causing LoRA state to be | |
| # reset mid-document and making distributed results differ from a | |
| # single-process run. To keep evaluation deterministic/correct, run the | |
| # document scan on rank 0 over the full validation stream and rely on the | |
| # existing metric reduction logic after local accumulation. | |
| if rank == 0: | |
| my_tokens = val_tokens | |
| doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size) | |
| else: | |
| my_tokens = val_tokens[:1] | |
| doc_ranges = [] |
| for li, adapter in lora_adapters.items(): | ||
| adapter.A.requires_grad_(True) | ||
| adapter.B.requires_grad_(True) | ||
|
|
||
| # Forward with LoRA (differentiable) | ||
| # We apply LoRA by modifying the output projection temporarily | ||
| # and running a small forward pass just for the gradient | ||
| for li, adapter in lora_adapters.items(): | ||
| delta = adapter.B @ adapter.A | ||
| base_model.qo_bank.data[n + li] = orig_out_weights[li] + delta.to(orig_out_weights[li].dtype) | ||
|
|
There was a problem hiding this comment.
In the adaptation phase, adapter.A/B.requires_grad_(True) doesn’t actually connect LoRA params to the loss because delta is written into qo_bank via .data, which breaks autograd tracking. Since gradients are being computed manually from qo_bank.grad, it would be clearer/safer to (1) drop the requires_grad_ toggles, and (2) avoid .data in favor of with torch.no_grad(): base_model.qo_bank[n+li].copy_(...) for the weight injection.
| # Zero grads | ||
| if base_model.qo_bank.grad is not None: | ||
| base_model.qo_bank.grad = None | ||
| for p in base_model.parameters(): | ||
| if p.grad is not None: | ||
| p.grad = None |
There was a problem hiding this comment.
Clearing gradients by iterating over base_model.parameters() every chunk is very expensive in doc-TTT eval and scales with full model size. Since only qo_bank grads are used, consider disabling requires_grad for all other parameters during TTT and using torch.autograd.grad(adapt_loss, base_model.qo_bank, retain_graph=False) (or base_model.qo_bank.grad = None + zero_grad(set_to_none=True) on just the needed tensors) to avoid per-parameter loops.
| - **EMA**: Exponential moving average with decay=0.997 | ||
| - **Tight SWA**: Stochastic weight averaging every 50 steps when LR < 20% | ||
| - **Late QAT**: Quantization-aware training activated when LR scale < 0.15 | ||
| - **Warmdown 0.75**: Wall-clock-aware cosine warmdown over 75% of training time |
There was a problem hiding this comment.
README claims a “wall-clock-aware cosine warmdown” but lr_mul() in train_gpt.py implements a linear ramp-down based on remaining warmdown time. Please align the documentation with the actual schedule (or implement cosine warmdown if that’s the intent).
| - **Warmdown 0.75**: Wall-clock-aware cosine warmdown over 75% of training time | |
| - **Warmdown 0.75**: Wall-clock-aware linear warmdown over 75% of training time |
…ward() to prevent gradient flow through distributed model
…; PRISM + Ouroboros papers; Session 13 - Merged SOTA 1.0810 unchanged (5-day plateau; 16 days to deadline) - PR openai#1610 (romeerp, 1.0728): VarLenAttn + PhasingTTT — legal, score-first compliant, but low EV (-0.0006 bpb) - PR openai#1619 flagged likely illegal (AdamW TTT — same pattern as rejected PR openai#771) - PRISM (arXiv:2602.10796, Feb 2026): Parallel Residual Iterative Sequence Model, 174x throughput — read before next recurrence architecture decision - Ouroboros (arXiv:2604.02051, Apr 2026): input-conditioned LoRA modulation for recursive transformers — watch - Session 13 added to CLAUDE.md; no strategy change (PR openai#1586 per-layer GPTQ still #1 priority) - daily_research.md Apr 14 entry added at top https://claude.ai/code/session_01GLn4VtS8D1uehRZnfb4dRe
|
Closing in favour of a cleaner replacement PR: new SP8192 + SGD-TTT + SDClip GPTQ + Brotli-11 submission (2026-04-16_SP8192_CleanStack_SGD_TTT). Apr-14 run regressed to 1.1600 BPB due to GPTQ degradation; new PR targets ~1.07-1.08 BPB with the proven SOTA stack. |
No description provided.