Skip to content

feat: recursive weight sharing for 16MB limit#15

Closed
ArthurKaroyan wants to merge 1 commit intoopenai:mainfrom
ArthurKaroyan:feat/recursive-transformer
Closed

feat: recursive weight sharing for 16MB limit#15
ArthurKaroyan wants to merge 1 commit intoopenai:mainfrom
ArthurKaroyan:feat/recursive-transformer

Conversation

@ArthurKaroyan
Copy link
Copy Markdown

No description provided.

Ueaj-Kerman added a commit to Ueaj-Kerman/parameter-golf that referenced this pull request Mar 19, 2026
Add entries openai#15-18 to experiment log covering three worktree experiments:
- GatedCausalConv (ssl): conv replacing first transformer block, best 1.2247 bpb
- NorMuon (normuon): per-row second moment normalization in Muon (code-only)
- SPlus (svdopt): SVD eigenbasis optimizer replacing Muon (code-only)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@0hq
Copy link
Copy Markdown
Contributor

0hq commented Mar 19, 2026

Not a valid submission, resubmit with training log to prove efficacy.

@0hq 0hq closed this Mar 19, 2026
gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…prime stride sampling

Inspired by PR openai#1099/openai#1060/openai#1135 which use TOKEN-level coprime stride. Token-level
needs 60+ LOC rewrite of TokenStream (no random access). Shipping the SHARD-LEVEL
variant: modify _advance_file() to use a coprime stride instead of +1, so nearby
training steps see topically-different shards rather than adjacent similar ones.

Implementation: 13 LOC, two anchors in TokenStream class (none of the existing
24 patches touch TokenStream — verified via grep). Gated by USE_COPRIME_STRIDE=1,
falls back to stride=1 default. Idempotent via COPRIME_STRIDE_MARKER.

Effect: with N shards and gcd(s,N)=1, iterates 0->s->2s->... covering all shards
before repeating. Max spacing diversity = better gradient noise reduction.

Smaller benefit than full token-level (~25% per PR openai#1099 logic), but ships TODAY
at near-zero risk vs. 60+ LOC structural rewrite.

4 CS experiments queued: CS0_alone, CS1_seed42, CS2_L4weights, CS3_with_engram.

This is the FIRST data-side patch in our 24-patch stack. Tests a completely new
vector after the "neutrality plateau" of architectural/optimizer/training-time
patches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants