Submission/sp8192 depthrecur adamwttt by AVINASH0052 · Pull Request #1619 · openai/parameter-golf

AVINASH0052 · 2026-04-14T14:45:52Z

No description provided.

Architecture: 10 layers, 512 dim, GQA 8h/4kv, MLP 3x (1536), LeakyReLU(0.5)^2 Training: Muon WD=0.04, EMA(0.997), Int6 STE QAT at 80%, LZMA preset=9 Eval: Sliding window stride=256 on EMA weights post int6 roundtrip Estimated BPB: ~1.15-1.16 (pending GPU run)

… — actual submission is 11L)

Copilot

Pull request overview

Adds new 10min_16mb training/eval scripts and record metadata for two runs, including a new SP8192 + depth recurrence + doc-level LoRA test-time-training (TTT) variant and a prior FullGPTQ/XSA/BigramHash baseline with run scripts.

Changes:

Add a new 2026-04-14 SP8192 model/training pipeline including depth recurrence, parallel residual blocks, and document-aware AdamW LoRA TTT evaluation.
Add submission metadata and README documentation for the 2026-04-14 record.
Add a 2026-04-08 baseline record including train script, submission metadata, README, and runnable shell scripts.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/train_gpt.py	New end-to-end training + eval + GPTQ quantization pipeline with added depth recurrence, parallel residuals, and doc-aware LoRA TTT eval.
records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/submission.json	Adds submission metadata for the new SP8192/depth-recur/doc-TTT run.
records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/README.md	Documents architecture/training/quantization/eval claims for the new run.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/train_gpt.py	Adds baseline training + GPTQ pipeline script for the 2026-04-08 record.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/submission.json	Adds baseline submission metadata with results and environment details.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/run_smoke_1gpu.sh	Adds 1-GPU smoke test runner for the baseline pipeline.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/run_leaderboard_8xh100.sh	Adds 8×H100 leaderboard runner for the baseline pipeline.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/README.md	Documents baseline architecture/training/quantization and how to run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T14:50:52Z

+# SP8192 + Depth Recurrence + VarLen Attention + Doc-LoRA TTT
+
+## Architecture
+
+- **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency
+- **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2)


The README title/architecture mentions “VarLen Attention”, but the training script doesn’t use any varlen attention kernel (and _flash_attn_varlen_func is unused). This makes the described feature set misleading; please either implement varlen attention or update the README to reflect the actual attention implementation.

Suggested change

# SP8192 + Depth Recurrence + VarLen Attention + Doc-LoRA TTT

## Architecture

- **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency

- **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2)

# SP8192 + Depth Recurrence + Doc-LoRA TTT

## Architecture

- **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency

- **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2)

- **Attention implementation**: Standard causal attention; no variable-length attention kernel is currently enabled in training

Copilot · 2026-04-14T14:50:52Z

+{
+  "track": "10min_16mb",
+  "tokenizer": "sp8192",
+  "architecture": "11L-512d-8H4KV-GQA-MLP3x-UNet-BigramHash3072-XSA11-VE128-DepthRecur-ParResid-DocTTT",
+  "features": [
+    "SP8192 tokenizer (8192 vocab BPE)",
+    "Depth recurrence on layers 3-5 (2 passes, gated blend)",
+    "Parallel residuals GPT-J style on layers 7+",
+    "QK-Gain 5.25",
+    "BigramHash 3072x112 with XOR hash",
+    "XSA on all 11 layers",
+    "SmearGate temporal smoothing",
+    "Value Embedding VE128 on layers 9-10",
+    "Partial RoPE 16/64 dims",
+    "U-Net encoder-decoder skip connections",
+    "Tied embeddings (std=0.005)",
+    "Muon optimizer (momentum=0.97, WD=0.095)",
+    "EMA decay=0.997 + Tight SWA every 50 steps",
+    "Late QAT at LR<15%",
+    "Warmdown fraction=0.75",
+    "Full Hessian GPTQ int6 (AR calib 64x2048, block=128)",
+    "Selective ±1 pruning + LZMA-9",
+    "Score-first AdamW Doc-LoRA TTT (chunk=64, rank=8, all layers)"
+  ],
+  "val_bpb": null,
+  "notes": "All code written from scratch. AdamW TTT is novel — nobody else uses AdamW for test-time LoRA adaptation."
+}


submission.json format appears inconsistent with other records/track_10min_16mb/*/submission.json entries (e.g., 2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/submission.json includes author, github_id, name, date, val_bpb, seed_results, etc.). If this file is consumed by tooling, missing standard fields (and val_bpb: null) may break parsers; consider matching the established schema used by other submissions.

Copilot · 2026-04-14T14:50:53Z

+try:
+    from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
+    _HAS_FA_VARLEN = True
+except ImportError:
+    _HAS_FA_VARLEN = False


flash_attn_varlen_func is imported and _HAS_FA_VARLEN is set, but the file never uses _flash_attn_varlen_func (the only reference is the import). Either remove the unused import/flag (and any related logging/docs), or add the intended varlen attention path so the “VarLen Attention” claim is accurate.

Copilot · 2026-04-14T14:50:53Z

+    """Find document boundaries based on token value wrapping / BOS detection.
+    Returns list of (start, end) index pairs."""
+    # Simple heuristic: treat token 0 or 1 as BOS markers, or detect large
+    # discontinuities. For FineWeb with SP tokenizer, documents are packed
+    # contiguously. We split on token_id == 1 (BOS).
+    boundaries = [0]
+    t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()
+    for i in range(1, len(t)):
+        if t[i] == 1:  # BOS token
+            boundaries.append(i)


Document boundary detection is hard-coded to token_id == 1 as BOS. SentencePiece BOS/CLS ids are model-dependent (and some datasets may not insert BOS at all), so this can split documents incorrectly or not at all. Consider using sp.bos_id() (or making BOS id configurable via env var) and/or validating that the packed dataset actually contains boundary markers before enabling doc-level TTT.

Suggested change

"""Find document boundaries based on token value wrapping / BOS detection.

Returns list of (start, end) index pairs."""

# Simple heuristic: treat token 0 or 1 as BOS markers, or detect large

# discontinuities. For FineWeb with SP tokenizer, documents are packed

# contiguously. We split on token_id == 1 (BOS).

boundaries = [0]

t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()

for i in range(1, len(t)):

if t[i] == 1: # BOS token

boundaries.append(i)

"""Find document boundaries using a configured BOS marker when available.

Returns list of (start, end) index pairs."""

# Document-level TTT requires reliable boundary markers in the packed token

# stream. SentencePiece BOS ids are tokenizer-dependent, and some datasets

# may not include BOS markers at all, so do not hard-code a token id here.

t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()

if len(t) == 0:

return []

bos_id_env = os.getenv("DOC_TTT_BOS_ID")

bos_id = None

if bos_id_env is not None and bos_id_env != "":

try:

bos_id = int(bos_id_env)

except ValueError:

bos_id = None

if bos_id is None:

return [(0, len(t))]

if vocab_size is not None and not (0 <= bos_id < vocab_size):

return [(0, len(t))]

boundaries = [0]

for i in range(1, len(t)):

if int(t[i]) == bos_id:

boundaries.append(i)

# If the packed dataset does not contain the configured BOS marker, treat

# the full span as a single document rather than splitting incorrectly.

if len(boundaries) == 1:

return [(0, len(t))]

Copilot · 2026-04-14T14:50:53Z

+    # Split validation tokens across ranks
+    rank_start = (total_tokens * rank) // world_size
+    rank_end = (total_tokens * (rank + 1)) // world_size
+    my_tokens = val_tokens[rank_start:rank_end + 1]
+
+    # Find document boundaries in our shard
+    doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)
+


eval_val_doc_ttt shards val_tokens by raw token index (rank_start/rank_end) and then detects document boundaries only inside that shard. If a document crosses a shard boundary, ranks will reset LoRA state mid-document, making distributed results differ from single-process evaluation. To keep eval deterministic/correct, partition work by document ranges (global boundary scan) or run doc-TTT eval on rank 0 only and all-reduce the final metrics.

Suggested change

# Split validation tokens across ranks

rank_start = (total_tokens * rank) // world_size

rank_end = (total_tokens * (rank + 1)) // world_size

my_tokens = val_tokens[rank_start:rank_end + 1]

# Find document boundaries in our shard

doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)

# Doc-TTT evaluation must preserve full-document state. Sharding by raw

# token index can split a document across ranks, causing LoRA state to be

# reset mid-document and making distributed results differ from a

# single-process run. To keep evaluation deterministic/correct, run the

# document scan on rank 0 over the full validation stream and rely on the

# existing metric reduction logic after local accumulation.

if rank == 0:

my_tokens = val_tokens

doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)

else:

my_tokens = val_tokens[:1]

doc_ranges = []

Copilot · 2026-04-14T14:50:54Z

+                for li, adapter in lora_adapters.items():
+                    adapter.A.requires_grad_(True)
+                    adapter.B.requires_grad_(True)
+
+                # Forward with LoRA (differentiable)
+                # We apply LoRA by modifying the output projection temporarily
+                # and running a small forward pass just for the gradient
+                for li, adapter in lora_adapters.items():
+                    delta = adapter.B @ adapter.A
+                    base_model.qo_bank.data[n + li] = orig_out_weights[li] + delta.to(orig_out_weights[li].dtype)
+


In the adaptation phase, adapter.A/B.requires_grad_(True) doesn’t actually connect LoRA params to the loss because delta is written into qo_bank via .data, which breaks autograd tracking. Since gradients are being computed manually from qo_bank.grad, it would be clearer/safer to (1) drop the requires_grad_ toggles, and (2) avoid .data in favor of with torch.no_grad(): base_model.qo_bank[n+li].copy_(...) for the weight injection.

Copilot · 2026-04-14T14:50:54Z

+                # Zero grads
+                if base_model.qo_bank.grad is not None:
+                    base_model.qo_bank.grad = None
+                for p in base_model.parameters():
+                    if p.grad is not None:
+                        p.grad = None


Clearing gradients by iterating over base_model.parameters() every chunk is very expensive in doc-TTT eval and scales with full model size. Since only qo_bank grads are used, consider disabling requires_grad for all other parameters during TTT and using torch.autograd.grad(adapt_loss, base_model.qo_bank, retain_graph=False) (or base_model.qo_bank.grad = None + zero_grad(set_to_none=True) on just the needed tensors) to avoid per-parameter loops.

Copilot · 2026-04-14T14:50:54Z

+- **EMA**: Exponential moving average with decay=0.997
+- **Tight SWA**: Stochastic weight averaging every 50 steps when LR < 20%
+- **Late QAT**: Quantization-aware training activated when LR scale < 0.15
+- **Warmdown 0.75**: Wall-clock-aware cosine warmdown over 75% of training time


README claims a “wall-clock-aware cosine warmdown” but lr_mul() in train_gpt.py implements a linear ramp-down based on remaining warmdown time. Please align the documentation with the actual schedule (or implement cosine warmdown if that’s the intent).

Suggested change

- **Warmdown 0.75**: Wall-clock-aware cosine warmdown over 75% of training time

- **Warmdown 0.75**: Wall-clock-aware linear warmdown over 75% of training time

…ward() to prevent gradient flow through distributed model

…; PRISM + Ouroboros papers; Session 13 - Merged SOTA 1.0810 unchanged (5-day plateau; 16 days to deadline) - PR openai#1610 (romeerp, 1.0728): VarLenAttn + PhasingTTT — legal, score-first compliant, but low EV (-0.0006 bpb) - PR openai#1619 flagged likely illegal (AdamW TTT — same pattern as rejected PR openai#771) - PRISM (arXiv:2602.10796, Feb 2026): Parallel Residual Iterative Sequence Model, 174x throughput — read before next recurrence architecture decision - Ouroboros (arXiv:2604.02051, Apr 2026): input-conditioned LoRA modulation for recursive transformers — watch - Session 13 added to CLAUDE.md; no strategy change (PR openai#1586 per-layer GPTQ still #1 priority) - daily_research.md Apr 14 entry added at top https://claude.ai/code/session_01GLn4VtS8D1uehRZnfb4dRe

AVINASH0052 · 2026-04-16T03:28:16Z

Closing in favour of a cleaner replacement PR: new SP8192 + SGD-TTT + SDClip GPTQ + Brotli-11 submission (2026-04-16_SP8192_CleanStack_SGD_TTT). Apr-14 run regressed to 1.1600 BPB due to GPTQ degradation; new PR targets ~1.07-1.08 BPB with the proven SOTA stack.

AVINASH0052 added 11 commits April 8, 2026 13:52

fix: use REPO_ROOT for data/tokenizer paths in run scripts

91c403e

fix: force absolute paths in run scripts, no env override

29617d7

feat: int6 bit-packing (4 vals → 3 bytes), saves 25% on model size

18ce50f

feat: 11L FullGPTQ + XSA-all + BigramHash 3072x112

cc908d1

chore: cleanup

50b2e9d

fix: robust Cholesky with damping fallback in GPTQ

d7f4e90

results: seed1337 val_bpb=1.11564047 15832508 bytes PASS

ed40ac9

docs: update README with actual results and structured tables

68596de

chore: remove stale 10L experiment folder from PR (audited wrong file…

c2983e6

… — actual submission is 11L)

Add SP8192+DepthRecur+AdamW-DocLoRA-TTT submission

a00e743

AVINASH0052 marked this pull request as ready for review April 14, 2026 14:46

Copilot AI review requested due to automatic review settings April 14, 2026 14:46

Copilot started reviewing on behalf of AVINASH0052 April 14, 2026 14:46 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

AVINASH0052 added 2 commits April 14, 2026 20:23

Fix: use sp1024 (sp8192 not in manifest), keep all novel features

c2729b1

fix: TTT adapt phase crash — use torch.autograd.grad instead of .back…

090fe02

…ward() to prevent gradient flow through distributed model

AVINASH0052 closed this Apr 16, 2026

AVINASH0052 mentioned this pull request Apr 16, 2026

Submission: SP8192 + DepthRecur + MuonEq-R + SGD-TTT + SDClip GPTQ + Brotli-11 #1658

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission/sp8192 depthrecur adamwttt#1619

Submission/sp8192 depthrecur adamwttt#1619
AVINASH0052 wants to merge 13 commits intoopenai:mainfrom
AVINASH0052:submission/sp8192-depthrecur-adamwttt

AVINASH0052 commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

AVINASH0052 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    """Find document boundaries based on token value wrapping / BOS detection.
-    Returns list of (start, end) index pairs."""
-    # Simple heuristic: treat token 0 or 1 as BOS markers, or detect large
-    # discontinuities. For FineWeb with SP tokenizer, documents are packed
-    # contiguously. We split on token_id == 1 (BOS).
-    boundaries = [0]
-    t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()
-    for i in range(1, len(t)):
-        if t[i] == 1:  # BOS token
-            boundaries.append(i)
+    """Find document boundaries using a configured BOS marker when available.
+    Returns list of (start, end) index pairs."""
+    # Document-level TTT requires reliable boundary markers in the packed token
+    # stream. SentencePiece BOS ids are tokenizer-dependent, and some datasets
+    # may not include BOS markers at all, so do not hard-code a token id here.
+    t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()
+    if len(t) == 0:
+        return []
+    bos_id_env = os.getenv("DOC_TTT_BOS_ID")
+    bos_id = None
+    if bos_id_env is not None and bos_id_env != "":
+        try:
+            bos_id = int(bos_id_env)
+        except ValueError:
+            bos_id = None
+    if bos_id is None:
+        return [(0, len(t))]
+    if vocab_size is not None and not (0 <= bos_id < vocab_size):
+        return [(0, len(t))]
+    boundaries = [0]
+    for i in range(1, len(t)):
+        if int(t[i]) == bos_id:
+            boundaries.append(i)
+    # If the packed dataset does not contain the configured BOS marker, treat
+    # the full span as a single document rather than splitting incorrectly.
+    if len(boundaries) == 1:
+        return [(0, len(t))]

-    # Split validation tokens across ranks
-    rank_start = (total_tokens * rank) // world_size
-    rank_end = (total_tokens * (rank + 1)) // world_size
-    my_tokens = val_tokens[rank_start:rank_end + 1]
-    # Find document boundaries in our shard
-    doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)
+    # Doc-TTT evaluation must preserve full-document state. Sharding by raw
+    # token index can split a document across ranks, causing LoRA state to be
+    # reset mid-document and making distributed results differ from a
+    # single-process run. To keep evaluation deterministic/correct, run the
+    # document scan on rank 0 over the full validation stream and rely on the
+    # existing metric reduction logic after local accumulation.
+    if rank == 0:
+        my_tokens = val_tokens
+        doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)
+    else:
+        my_tokens = val_tokens[:1]
+        doc_ranges = []

	- Warmdown 0.75: Wall-clock-aware cosine warmdown over 75% of training time
	- Warmdown 0.75: Wall-clock-aware linear warmdown over 75% of training time

Conversation

AVINASH0052 commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

AVINASH0052 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants