Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715) by yashverms · Pull Request #418 · openai/parameter-golf

yashverms · 2026-03-22T11:57:48Z

Summary

Non-record submission exploring 3 novel techniques not yet attempted in any merged or open PR, built on the proven PR #315 technique stack.

Novel Contributions

DiffTransformer V2 Attention (last 2 layers) — noise-cancelled attention via differential softmax maps (Ye et al., ICLR 2025 Oral)
NorMuon Optimizer — replaces Muon with per-neuron row normalization after Newton-Schulz orthogonalization, ~11% better compute efficiency
TrigramHash + Context-Aware N-gram Gating — extends BigramHash with trigram patterns and a learned sigmoid gate that modulates n-gram signal based on hidden state (inspired by DeepSeek Engram)

Architecture

11 layers, 512 dim, 8/4 heads (GQA), MLP 3× (ReLU²)
XSA on last 6 layers, DiffAttn on last 2
Partial RoPE (16/64 dims), LN depth scaling, SmearGate
BigramHash(2048) + TrigramHash(2048) + context-aware gate
U-Net skips, tied embeddings, logit softcap

Results

Metric	Value
val_bpb (post-quant)	1.1715 (no sliding window)
Pre-quant val_bpb	1.1607
Steps	4,600 (600s wallclock)
Params	27,518,587
Artifact	15,586,651 bytes (int6+zstd-22)
GPU	8×H100 SXM

Gap Analysis

Score is ~0.029 bpb behind merged SOTA (1.1428). Key factors: no sliding window eval (~0.03 bpb), small BigramHash (2048 vs 10240), NorMuon momentum=0.95 vs proven 0.99, SDPA fallback instead of Flash Attention 3. The submitted code has these issues fixed (sliding window re-enabled, correct 16MB decimal limit).

Why This Is Interesting

First submission using Differential Attention in this competition
First submission using NorMuon optimizer
First submission with context-aware n-gram gating
Documents which 2026 architectural innovations transfer (or don't) to the 16MB parameter-constrained regime

Test plan

Training completes within 600s on 8×H100
Artifact under 16,000,000 bytes
Post-quant roundtrip evaluation produces valid val_bpb
Code is self-contained in train_gpt.py
Sliding window eval (re-enabled in submitted code, not yet run)
Multi-seed verification (single seed only in this submission)

Made with Cursor

…igramHash Three novel techniques on top of PR openai#315's stack: 1. DiffTransformer V2 attention (last 2 layers) for noise-cancelled attention 2. NorMuon optimizer with per-neuron row normalization 3. TrigramHash + context-aware n-gram gating 11L/512d, XSA6, Partial RoPE, int6+zstd-22. Post-quant val_bpb=1.1715 (without sliding window eval). 8xH100, 600s, 15.59MB artifact. Made-with: Cursor

MatoTeziTanka · 2026-04-12T14:19:02Z

Sibling Draft Review for PR #418

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker	Found	Notes
target_in_key	False	Custom loss key injection pattern
TTT	False	Test-Time Training integration
SLOT	False	Slot-based attention variant
custom_tokenizer	True	3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1544 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: False

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

Code style and formatting (PEP 8, consistency)
Numerical stability of modifications
Integration with existing pipeline
Reproducibility (seed handling, RNG control)
Performance metrics reported correctly
No undocumented dependencies

Next Steps

Full train_gpt.py code review
Validate loss computation and gradients
Check for integration issues with existing pipeline
Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

MatoTeziTanka · 2026-04-12T14:19:09Z

Sibling Draft Review for PR #418

Date: 2026-04-12
Reviewer: Claude Agent (for Mato/@MatoTeziTanka)
NEEDS_LLM_REVIEW: Yes
Status: Ready for detailed code review

PR Summary

Title: Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)
State: OPEN
Mato blocking comment: NO

Marker Analysis

Marker	Found	Notes
target_in_key	False	Custom loss key injection pattern
TTT	False	Test-Time Training integration
SLOT	False	Slot-based attention variant
custom_tokenizer	True	3 patterns detected

Architecture Changes

train_gpt.py modifications: ~1544 lines changed
Quantization: True
EMA: True
LoRA: False
GPTQ: False

Assessment

⚠ REQUIRES_REVIEW - Custom modifications detected:

Custom tokenizer patterns found (3 instances)

Recommendation

REVIEW — Detailed code inspection needed before merge

Review Checklist

Code style and formatting (PEP 8, consistency)
Numerical stability of modifications
Integration with existing pipeline
Reproducibility (seed handling, RNG control)
Performance metrics reported correctly
No undocumented dependencies

Next Steps

Full train_gpt.py code review
Validate loss computation and gradients
Check for integration issues with existing pipeline
Verify metrics reproducibility

Generated for: Mato (@MatoTeziTanka)
Scope: NEEDS_LLM_REVIEW sweep
Action: Draft only—no posts to GitHub

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE-pending standard record-track checks.

Reviewed by @MatoTeziTanka — The Agora. Classification via sibling-session agent (Haiku-backed). This review was drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

MatoTeziTanka · 2026-04-12T14:20:57Z

Community Review — PrismLM v3 (DiffAttn + NorMuon + TrigramHash)

Compliance: LOOKS CLEAN — no disqualifying patterns found

Checked all five compliance vectors:

N-gram family bug: BigramHash and TrigramHash both use only context tokens (t[i], t[i-1], t[i-2]) — no target token in the key. Legal.
Pre-Quant TTT: None present. Eval runs under torch.inference_mode() with no optimizer steps on val_tokens.
Legal TTT: No TTT of any form.
Scored-region SLOT: Sliding window code exists but was not used for the submitted score. No SLOT concern at this time.
Pure neural: Uses BigramHash + TrigramHash (learned embeddings). No count-based components.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks. Novel techniques (DiffAttn, NorMuon, TrigramHash+gate) are architecturally interesting and compliant.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, with manual override on classification. If this review misread your code, please call it out so I can re-audit manually.

yashverms · 2026-04-12T22:33:46Z

Thanks for the thorough review @MatoTeziTanka — really appreciate the detailed compliance audit and the time you put into walking through the n-gram key construction, eval mode, and gating logic.

To confirm: the 3 "custom tokenizer patterns" flagged in the initial sweep are the BigramHash and TrigramHash embedding tables — these are learned embedding lookups keyed on context-only tokens (t[i-1], t[i]) and (t[i-2], t[i-1], t[i]) respectively. No target token is ever used in the hash key, as your detailed compliance review confirmed. The context-aware sigmoid gate modulates the combined n-gram signal using the hidden state (which is also strictly causal). So everything stays within the legal boundary.

On the sliding window eval: the submitted train_gpt.py has it re-enabled, but the reported 1.1715 score was measured without it (standard eval only). I plan to run a sliding window eval pass to update the score — expect roughly ~0.025–0.03 BPB improvement based on community data.

Happy to clarify anything else. Thanks again for the review and the merge recommendation.

abaybektursun mentioned this pull request Mar 26, 2026

Non-record: Negative results — quantization algorithms, TTT, architecture, and self-generated GPTQ calibration study #756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)#418

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)#418
yashverms wants to merge 1 commit intoopenai:mainfrom
yashverms:prismlm-v3-non-record

yashverms commented Mar 22, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

yashverms commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yashverms commented Mar 22, 2026

Summary

Novel Contributions

Architecture

Results

Gap Analysis

Why This Is Interesting

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Sibling Draft Review for PR #418

PR Summary

Marker Analysis

Architecture Changes

Assessment

Recommendation

Review Checklist

Next Steps

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Sibling Draft Review for PR #418

PR Summary

Marker Analysis

Architecture Changes

Assessment

Recommendation

Review Checklist

Next Steps

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — PrismLM v3 (DiffAttn + NorMuon + TrigramHash)

Uh oh!

yashverms commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants