Skip to content

Submission/sp8192 depthrecur adamwttt#1619

Closed
AVINASH0052 wants to merge 13 commits intoopenai:mainfrom
AVINASH0052:submission/sp8192-depthrecur-adamwttt
Closed

Submission/sp8192 depthrecur adamwttt#1619
AVINASH0052 wants to merge 13 commits intoopenai:mainfrom
AVINASH0052:submission/sp8192-depthrecur-adamwttt

Conversation

@AVINASH0052
Copy link
Copy Markdown

No description provided.

@AVINASH0052 AVINASH0052 marked this pull request as ready for review April 14, 2026 14:46
Copilot AI review requested due to automatic review settings April 14, 2026 14:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new 10min_16mb training/eval scripts and record metadata for two runs, including a new SP8192 + depth recurrence + doc-level LoRA test-time-training (TTT) variant and a prior FullGPTQ/XSA/BigramHash baseline with run scripts.

Changes:

  • Add a new 2026-04-14 SP8192 model/training pipeline including depth recurrence, parallel residual blocks, and document-aware AdamW LoRA TTT evaluation.
  • Add submission metadata and README documentation for the 2026-04-14 record.
  • Add a 2026-04-08 baseline record including train script, submission metadata, README, and runnable shell scripts.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/train_gpt.py New end-to-end training + eval + GPTQ quantization pipeline with added depth recurrence, parallel residuals, and doc-aware LoRA TTT eval.
records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/submission.json Adds submission metadata for the new SP8192/depth-recur/doc-TTT run.
records/track_10min_16mb/2026-04-14_SP8192_DepthRecur_VarLen_DocTTT/README.md Documents architecture/training/quantization/eval claims for the new run.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/train_gpt.py Adds baseline training + GPTQ pipeline script for the 2026-04-08 record.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/submission.json Adds baseline submission metadata with results and environment details.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/run_smoke_1gpu.sh Adds 1-GPU smoke test runner for the baseline pipeline.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/run_leaderboard_8xh100.sh Adds 8×H100 leaderboard runner for the baseline pipeline.
records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/README.md Documents baseline architecture/training/quantization and how to run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +6
# SP8192 + Depth Recurrence + VarLen Attention + Doc-LoRA TTT

## Architecture

- **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency
- **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README title/architecture mentions “VarLen Attention”, but the training script doesn’t use any varlen attention kernel (and _flash_attn_varlen_func is unused). This makes the described feature set misleading; please either implement varlen attention or update the README to reflect the actual attention implementation.

Suggested change
# SP8192 + Depth Recurrence + VarLen Attention + Doc-LoRA TTT
## Architecture
- **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency
- **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2)
# SP8192 + Depth Recurrence + Doc-LoRA TTT
## Architecture
- **Tokenizer**: SP8192 (SentencePiece BPE 8192 vocab) — 8x larger vocab for better token efficiency
- **Model**: 11 layers, 512d, 8 heads / 4 KV heads (GQA), MLP 3x (LeakyReLU(0.5)^2)
- **Attention implementation**: Standard causal attention; no variable-length attention kernel is currently enabled in training

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +27
{
"track": "10min_16mb",
"tokenizer": "sp8192",
"architecture": "11L-512d-8H4KV-GQA-MLP3x-UNet-BigramHash3072-XSA11-VE128-DepthRecur-ParResid-DocTTT",
"features": [
"SP8192 tokenizer (8192 vocab BPE)",
"Depth recurrence on layers 3-5 (2 passes, gated blend)",
"Parallel residuals GPT-J style on layers 7+",
"QK-Gain 5.25",
"BigramHash 3072x112 with XOR hash",
"XSA on all 11 layers",
"SmearGate temporal smoothing",
"Value Embedding VE128 on layers 9-10",
"Partial RoPE 16/64 dims",
"U-Net encoder-decoder skip connections",
"Tied embeddings (std=0.005)",
"Muon optimizer (momentum=0.97, WD=0.095)",
"EMA decay=0.997 + Tight SWA every 50 steps",
"Late QAT at LR<15%",
"Warmdown fraction=0.75",
"Full Hessian GPTQ int6 (AR calib 64x2048, block=128)",
"Selective ±1 pruning + LZMA-9",
"Score-first AdamW Doc-LoRA TTT (chunk=64, rank=8, all layers)"
],
"val_bpb": null,
"notes": "All code written from scratch. AdamW TTT is novel — nobody else uses AdamW for test-time LoRA adaptation."
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json format appears inconsistent with other records/track_10min_16mb/*/submission.json entries (e.g., 2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072/submission.json includes author, github_id, name, date, val_bpb, seed_results, etc.). If this file is consumed by tooling, missing standard fields (and val_bpb: null) may break parsers; consider matching the established schema used by other submissions.

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +54
try:
from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
_HAS_FA_VARLEN = True
except ImportError:
_HAS_FA_VARLEN = False
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flash_attn_varlen_func is imported and _HAS_FA_VARLEN is set, but the file never uses _flash_attn_varlen_func (the only reference is the import). Either remove the unused import/flag (and any related logging/docs), or add the intended varlen attention path so the “VarLen Attention” claim is accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +979 to +988
"""Find document boundaries based on token value wrapping / BOS detection.
Returns list of (start, end) index pairs."""
# Simple heuristic: treat token 0 or 1 as BOS markers, or detect large
# discontinuities. For FineWeb with SP tokenizer, documents are packed
# contiguously. We split on token_id == 1 (BOS).
boundaries = [0]
t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()
for i in range(1, len(t)):
if t[i] == 1: # BOS token
boundaries.append(i)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document boundary detection is hard-coded to token_id == 1 as BOS. SentencePiece BOS/CLS ids are model-dependent (and some datasets may not insert BOS at all), so this can split documents incorrectly or not at all. Consider using sp.bos_id() (or making BOS id configurable via env var) and/or validating that the packed dataset actually contains boundary markers before enabling doc-level TTT.

Suggested change
"""Find document boundaries based on token value wrapping / BOS detection.
Returns list of (start, end) index pairs."""
# Simple heuristic: treat token 0 or 1 as BOS markers, or detect large
# discontinuities. For FineWeb with SP tokenizer, documents are packed
# contiguously. We split on token_id == 1 (BOS).
boundaries = [0]
t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()
for i in range(1, len(t)):
if t[i] == 1: # BOS token
boundaries.append(i)
"""Find document boundaries using a configured BOS marker when available.
Returns list of (start, end) index pairs."""
# Document-level TTT requires reliable boundary markers in the packed token
# stream. SentencePiece BOS ids are tokenizer-dependent, and some datasets
# may not include BOS markers at all, so do not hard-code a token id here.
t = tokens.cpu().numpy() if tokens.is_cuda else tokens.numpy()
if len(t) == 0:
return []
bos_id_env = os.getenv("DOC_TTT_BOS_ID")
bos_id = None
if bos_id_env is not None and bos_id_env != "":
try:
bos_id = int(bos_id_env)
except ValueError:
bos_id = None
if bos_id is None:
return [(0, len(t))]
if vocab_size is not None and not (0 <= bos_id < vocab_size):
return [(0, len(t))]
boundaries = [0]
for i in range(1, len(t)):
if int(t[i]) == bos_id:
boundaries.append(i)
# If the packed dataset does not contain the configured BOS marker, treat
# the full span as a single document rather than splitting incorrectly.
if len(boundaries) == 1:
return [(0, len(t))]

Copilot uses AI. Check for mistakes.
Comment on lines +1019 to +1026
# Split validation tokens across ranks
rank_start = (total_tokens * rank) // world_size
rank_end = (total_tokens * (rank + 1)) // world_size
my_tokens = val_tokens[rank_start:rank_end + 1]

# Find document boundaries in our shard
doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval_val_doc_ttt shards val_tokens by raw token index (rank_start/rank_end) and then detects document boundaries only inside that shard. If a document crosses a shard boundary, ranks will reset LoRA state mid-document, making distributed results differ from single-process evaluation. To keep eval deterministic/correct, partition work by document ranges (global boundary scan) or run doc-TTT eval on rank 0 only and all-reduce the final metrics.

Suggested change
# Split validation tokens across ranks
rank_start = (total_tokens * rank) // world_size
rank_end = (total_tokens * (rank + 1)) // world_size
my_tokens = val_tokens[rank_start:rank_end + 1]
# Find document boundaries in our shard
doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)
# Doc-TTT evaluation must preserve full-document state. Sharding by raw
# token index can split a document across ranks, causing LoRA state to be
# reset mid-document and making distributed results differ from a
# single-process run. To keep evaluation deterministic/correct, run the
# document scan on rank 0 over the full validation stream and rely on the
# existing metric reduction logic after local accumulation.
if rank == 0:
my_tokens = val_tokens
doc_ranges = _find_document_boundaries(my_tokens, args.vocab_size)
else:
my_tokens = val_tokens[:1]
doc_ranges = []

Copilot uses AI. Check for mistakes.
Comment on lines +1096 to +1106
for li, adapter in lora_adapters.items():
adapter.A.requires_grad_(True)
adapter.B.requires_grad_(True)

# Forward with LoRA (differentiable)
# We apply LoRA by modifying the output projection temporarily
# and running a small forward pass just for the gradient
for li, adapter in lora_adapters.items():
delta = adapter.B @ adapter.A
base_model.qo_bank.data[n + li] = orig_out_weights[li] + delta.to(orig_out_weights[li].dtype)

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the adaptation phase, adapter.A/B.requires_grad_(True) doesn’t actually connect LoRA params to the loss because delta is written into qo_bank via .data, which breaks autograd tracking. Since gradients are being computed manually from qo_bank.grad, it would be clearer/safer to (1) drop the requires_grad_ toggles, and (2) avoid .data in favor of with torch.no_grad(): base_model.qo_bank[n+li].copy_(...) for the weight injection.

Copilot uses AI. Check for mistakes.
Comment on lines +1134 to +1139
# Zero grads
if base_model.qo_bank.grad is not None:
base_model.qo_bank.grad = None
for p in base_model.parameters():
if p.grad is not None:
p.grad = None
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing gradients by iterating over base_model.parameters() every chunk is very expensive in doc-TTT eval and scales with full model size. Since only qo_bank grads are used, consider disabling requires_grad for all other parameters during TTT and using torch.autograd.grad(adapt_loss, base_model.qo_bank, retain_graph=False) (or base_model.qo_bank.grad = None + zero_grad(set_to_none=True) on just the needed tensors) to avoid per-parameter loops.

Copilot uses AI. Check for mistakes.
- **EMA**: Exponential moving average with decay=0.997
- **Tight SWA**: Stochastic weight averaging every 50 steps when LR < 20%
- **Late QAT**: Quantization-aware training activated when LR scale < 0.15
- **Warmdown 0.75**: Wall-clock-aware cosine warmdown over 75% of training time
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README claims a “wall-clock-aware cosine warmdown” but lr_mul() in train_gpt.py implements a linear ramp-down based on remaining warmdown time. Please align the documentation with the actual schedule (or implement cosine warmdown if that’s the intent).

Suggested change
- **Warmdown 0.75**: Wall-clock-aware cosine warmdown over 75% of training time
- **Warmdown 0.75**: Wall-clock-aware linear warmdown over 75% of training time

Copilot uses AI. Check for mistakes.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 14, 2026
…; PRISM + Ouroboros papers; Session 13

- Merged SOTA 1.0810 unchanged (5-day plateau; 16 days to deadline)
- PR openai#1610 (romeerp, 1.0728): VarLenAttn + PhasingTTT — legal, score-first compliant, but low EV (-0.0006 bpb)
- PR openai#1619 flagged likely illegal (AdamW TTT — same pattern as rejected PR openai#771)
- PRISM (arXiv:2602.10796, Feb 2026): Parallel Residual Iterative Sequence Model, 174x throughput — read before next recurrence architecture decision
- Ouroboros (arXiv:2604.02051, Apr 2026): input-conditioned LoRA modulation for recursive transformers — watch
- Session 13 added to CLAUDE.md; no strategy change (PR openai#1586 per-layer GPTQ still #1 priority)
- daily_research.md Apr 14 entry added at top

https://claude.ai/code/session_01GLn4VtS8D1uehRZnfb4dRe
@AVINASH0052
Copy link
Copy Markdown
Author

Closing in favour of a cleaner replacement PR: new SP8192 + SGD-TTT + SDClip GPTQ + Brotli-11 submission (2026-04-16_SP8192_CleanStack_SGD_TTT). Apr-14 run regressed to 1.1600 BPB due to GPTQ degradation; new PR targets ~1.07-1.08 BPB with the proven SOTA stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants