Non-record: sin² activation + causal screening pipeline by clthuang · Pull Request #877 · openai/parameter-golf

clthuang · 2026-03-26T18:00:01Z

Non-Record Submission: sin² Activation + Causal Screening Pipeline

Summary

sin² periodic activation replacing relu² in MLP layers (1-line change in train_gpt.py)
Discovered via an automated causal screening pipeline that tested 24 interventions
Novel explorations: adaptive multi-token prediction (variable-depth MTP) and Rho-1 selective loss masking

Key Change

# Before (relu²):
x = torch.relu(self.fc(x))
return self.proj(x.square())

# After (sin²):
return self.proj(torch.sin(self.fc(x)).square())

Why sin²?

Motivated by FAN/FANformer (NeurIPS 2025) showing 31% parameter efficiency with periodic activations at 1B scale. sin² preserves relu²'s output range (non-negative) and squaring structure, while eliminating the hard zero cutoff that kills 50% of activations.

Screening Results (24 interventions, Apple Silicon MLX)

Intervention	Train Loss Delta	Notes
softcap=20	-0.447	Strongest signal
dim=640	-0.311	Wider > deeper for short training
sin² activation	-0.017	Competitive with GELU (-0.014) and SiLU (-0.018)
adaptive-K MTP	+0.008	Needs longer training (warmup-dependent)
rho1 selective loss	pending	Threshold calibration needed

Infrastructure

Built a 12-script automated experimentation pipeline (144 tests):

Causal DAG extraction from leaderboard ablation data
Discovery-adjust cycle with paired-seed experiments
In-process MLX training with warmup caching (7× faster than subprocess)
Per-step loss tracking with curve plotting

Status

✅ Local MLX screening complete (24 interventions)
⏳ H100 validation pending compute credits
⏳ 3-seed statistical significance pending

Files

records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/train_gpt.py — modified baseline with sin² activation
records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/README.md — detailed writeup
records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/submission.json — metadata

…ixes

… feedback mechanism, disambiguation

…d mode, C10 sentinel, diagnostic paths

…s field

…tting

…p_max_cycles, shard_variance note

…ausal/requirements.txt

…error handling, complexity labels

…kpoint prereqs, FCI degenerate test

…ADME output

…ghten descriptions

…st, parsing, dep check

…ion refs

- Install causal-learn, statsmodels, networkx, graphviz, scipy, pytest via uv - Create scripts/causal/ and tests/causal/ directory structure - Implement common.py with 9 utility functions (load_model, compute_bpb, paired_ttest, etc.) - 25 tests passing, 1 skipped (load_model checkpoint integration test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…on, quant gap T14 (statistical_analysis.py): Paired-seed ablation analysis with bootstrap CI, Holm-Bonferroni correction, decision gate, and platform transfer coefficient. T15 (token_loss_decompose.py): Per-token loss decomposition with frequency bucketing (top-100/mid/tail), boundary/mid-sequence classification, and decomposition verification. Integration requires checkpoint; unit tests use mock data. T16 (quant_gap_analysis.py): Pre/post quantization BPB gap computation with 3x threshold check against largest training effect. All 17 new tests pass. No regressions (102 total pass). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parallel implementation of 8 scripts covering DAG discovery, experiment pipeline, and diagnostic probes: - extract_interventions.py: 3-tier README parser, 0.90 field coverage (23 tests) - estimate_dag.py: expert DAG + FCI validation + cycle updates (9 tests) - experiment_runner.py: paired seed ablation with error handling (14 tests) - statistical_analysis.py: effect estimation with Holm-Bonferroni (8 tests) - token_loss_decompose.py: per-token loss attribution (4 tests) - quant_gap_analysis.py: pre/post quantization gap (5 tests) - influence_proxy.py: gradient inner product shard scoring (7 tests) - gradient_attribution.py: training loop instrumentation (7 tests) 102 tests passing, 1 skipped (checkpoint integration). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- scripts/causal/README.md: cycle protocol, CLI usage for all 9 scripts - identifiability_check.py: data quality assessment, confounded pairs, unexplored combinations - 14 new tests, 116 total passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ations - Fix estimate_dag.py import: common → scripts.causal.common - Update spec: consolidate shard_variance_check.py into influence_proxy.py - Mark T19 as deferred (depends on experiment cycle results) - Implement phase_correlations in gradient_attribution.py (was stub) - 116 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…utput - Test now looks for *_mlx_model.npz (actual output format) instead of .safetensors - Fix docstring in common.py to mention .npz - 117/117 tests passing, 0 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All 18 implementation tasks complete (117/117 tests passing). T19 (submission assembly) deferred pending experiment cycle results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- submission_assembly.py: builds competition-ready submission directory - Validates README sections, artifact size, submission.json schema - --dry-run produces dummy submission for testing the pipeline end-to-end - 15 new tests, 132 total passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Causal inference pipeline for parameter golf: - 11 scripts in scripts/causal/ (DAG discovery, experiment runner, diagnostics, submission assembly) - 132 tests, all passing - Discovery-adjust cycle: extract → DAG → experiment → analyze → submit

- run_pipeline.py creates SharedTrainingContext once, runs experiments in-process with warmup caching (skip warmup for same-architecture runs) - Falls back to subprocess on failure - Upgraded to Python 3.12, all deps managed via uv add - requirements.txt exported from uv Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Experiment checkpoints may have non-default Hyperparameters (e.g., NUM_KV_HEADS=2). The test now explicitly looks for default_ckpt_mlx_model.npz first, falling back to any checkpoint only if the default doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- experiment_runner.py: add --inprocess flag and inprocess_ctx param to run_condition() for in-process training with subprocess fallback - statistical_analysis.py: add skipped_seeds field to comparison output (tracks seeds dropped due to errors or None val_bpb) - extract_interventions.py: refactor 7 sequential regex patterns into table-driven _BASE_BPB_PATTERNS list (~50 lines → ~15 lines) - token_loss_decompose.py: rename aggregate_bpb → aggregate_bpb_approx to make the approximation explicit 132 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Before: warmup_steps=20 × grad_accum_steps=8 = 160 forward+backward passes After: warmup_steps=1 × grad_accum_steps=1 = 1 forward+backward pass The _WARMED_ARCHITECTURES cache already skips warmup on same-architecture repeat runs. This change reduces the first-run cost from ~160 passes to ~1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- inprocess_trainer: log step progress and val_bpb at each validation - run_pipeline: set VAL_LOSS_EVERY=0 (validate only at last step) The full 62M-token validation set was running every 2-4 steps, dominating screening time. Now runs once at the end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- inprocess_trainer: collect step_losses per training step, add screening_mode param that skips 62M-token validation (uses train_loss as comparison metric instead — 10x faster for screening) - run_pipeline: pass screening_mode=True for screening experiments - plot_losses.py: new script for treatment vs control loss curve plots with per-seed lines, mean curves, and shaded std regions - 8 new tests (5 inprocess + 3 plot), 135 total passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- inprocess_trainer: monkey-patch MLP.__call__ with configurable activation (gelu, silu, sin, sin_sq, fan) + auto-restore after each run - run_pipeline: add "activation" search space with 4 variants - Treatment uses specified activation; control always uses baseline relu² - Inspired by FAN/FANformer (NeurIPS 2025): sin-based activations show 31% parameter efficiency gains at 1B scale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove FAN variant (dimension mismatch: concat doubles hidden dim) - Add Priority 3 in select_interventions: sweep search space entries not in DAG (e.g., activation variants), so they get tested even though "activation" is not a causal DAG node - Verified: 4 activation variants (gelu, silu, sin, sin_sq) produce divergent training dynamics after optimizer steps - Verified: monkey-patch restore works between treatment/control runs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…variants Implement two novel loss functions as monkey-patchable variants for the causal screening pipeline. Rho-1 masks easy tokens by max-logit threshold, focusing training on hard tokens. Adaptive-K predicts N+2 tokens on high-confidence positions (high logit margin), with warmup period. Both are zero-parameter, loss-only changes that compose with activation variants. Wired into inprocess_trainer and pipeline search space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- --fast preset: 50 iters, 65K batch, 5 layers, 2 seeds (~18× faster) - --screen-batch and --screen-layers flags for custom reduction - Balanced reduction: treatment and control use identical reduced settings, preserving relative comparison validity - Control always matches treatment's layer count (fair comparison) - 144 tests passing Usage: python scripts/causal/run_pipeline.py --fast --max-cycles 4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

_logger was defined inside the training loop section but referenced earlier in the activation patch log line. Moved to top of function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

30-second pause between experiments by default. Prevents Mac from overheating during long screening runs. Adjustable via --cooldown N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…results Training scripts (3 incremental scripts, env-var controlled): - train_gpt_r1.py: LeakyReLU², BigramHash, XSA, Value residual, U-Net skips - train_gpt_r2.py: + 7 MLP variants (FAN, DML-Gated, DML-Orth, FAN+DML, CausalWide, DML-CausalWide), token dropout, corrupted context, Barlow Twins, adversarial embedding masking, per-position loss - train_gpt_r3.py: + sliding window eval, Legal TTT Screening infrastructure: - scripts/run_full_screen.sh: smoke test + R1→R2→R3 sequential runner - scripts/run_screen.sh: per-round runner with --gpu/--gpus flags - scripts/run_benchmark.sh: SOTA vs our best configs - scripts/score_and_reorder_data.py: offline difficulty scoring for curriculum Key results (8xH100, 1 shard): - Our corrupted context (val_bpb=1.3009) beats SOTA openai#1 (1.3315) by -0.031 - DML-Gated MLP + corruption (1.3100) also beats both SOTAs - Data augmentation dominates: corruption > token dropout > graduated variants - Novel MLP architectures help: DML-Gated (-0.043 vs baseline) Tests: 116 tests (unit + integration), all passing Specs: formal math specs with verifiable DoDs for all 20 experiments Docs: full experiment log with ideas, references, and results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Complete rewrite of experiments.md with all screening results: - Master summary table: 28 experiments across 4 hardware configs - Phase 1 activation screen: 8 experiments (LeakyReLU best, sin² worst) - R1 technique stack: 4 experiments with 5090 + H100 results - R2 novel designs: 12 experiments (corrupted context best at 1.3004) - R3 eval tricks: TTT (-0.095 bpb), sliding window (no help) - Benchmark vs SOTA: our corrupted context beats both SOTA repros - Results analysis: technique rankings, gap analysis vs competition Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Clean train_gpt.py (1122 lines) with R1-5 stack + 10% corrupted context. Defaults hardcoded: 11L/3x, BigramHash 3072, XSA4, Value Residual. Stripped all experimental MLP variants for submission clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace int8+zlib (24.5MB, over 16MB limit) with int6 per-row quantization (clip_range=31, best-of-5 percentile search) + lzma preset=9 compression. Adapted from PR openai#414 GPTQ-lite approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

9 changes: batch 786K, warmdown 3500, XSA all 11 layers, EMA(0.997), late QAT (STE at scale<0.15), Muon WD 0.04, partial RoPE 16/64, momentum 0.99/1500-warmup, grad clip 0.3. Int6 GPTQ-lite + lzma compression for <16MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…16MB 786K batch caused 130ms/step (vs 87ms at 524K), giving only 4616 steps and worse val_bpb (1.2817 vs 1.2381). Revert to 524K for ~6900 steps. Reduce bigram_dim 128→112 to save ~2MB in int6+lzma artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace per-row int6 (17.9MB) with per-group-64 bit-packed int6. 4 values per 3 bytes, best-of-5 percentile search per group. Empirically validated: 27M params → 8.3MB with lzma preset=9. QAT updated to per-group-64 to match quantization format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…0.3MB) With per-group int6 bit-packing, artifact is 8.3MB for 27M params. Upgrading to 4x MLP (2048 hidden) adds 5.8M params for ~10.3MB total, still well under 16MB limit with 5.7MB headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updated with 4x MLP, per-group int6 bit-packed quantization, EMA, QAT, partial RoPE, and all SOTA fundamentals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Collect per-input-channel activation magnitudes from validation data, scale weights by importance^alpha before int6 quantization, undo at dequant. Two independent scales: per-group quant scale + per-column AWQ scale. Local test shows AWQ needs real trained weights to help (random weights don't benefit). Needs H100 empirical validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace AWQ/per-group quantization with proven Full Hessian GPTQ: - Hessian H=X^TX collected from validation data via forward hooks - Cholesky factorization of H^{-1} for optimal error compensation - Column reordering by Hessian diagonal magnitude - Block-wise (128) quantization with cross-block error propagation - Best-of-5 percentile search with error compensation - Fallback to percentile-only for 1D/no-Hessian tensors - Bit-packed int6 serialization + lzma compression SOTA achieves 0.018 bpb quant loss with this approach. Local test on random weights shows GPTQ needs real trained weights to outperform simple percentile (expected behavior). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Surgical 2-change fork of 2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233: 1. Add CORRUPT_RATE env var (default 0.1) to Hyperparameters 2. Add 5-line corruption block before model(x,y) in training loop All other paper settings preserved: FA3, seq_len=2048, batch=786K, mlp_mult=3, XSA=last4, EMA=0.997, SWA, warmdown=3500. With CORRUPT_RATE=0 this script is bit-identical to the paper. With CORRUPT_RATE=0.1 it tests our novel technique on the SOTA baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T13:31:51Z

Community Review — Non-record: sin² activation + causal screening pipeline

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

--- ## Analysis ### 1. BigramHash — CLEAN (no target leakage) BigramHashEmbedding.bigram_hash at lines 662–668: python out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod The XOR mixes t[i] (current token) with t[i-1] (previous token). No target (next-token) is involved in the hash key. This is a standard causal bigram embedding. The ILLEGAL pattern (target XOR'd into hash key) is NOT present. ### 2. No TTT of any kind There is no test-time training anywhere in the file. The model is trained once on train tokens, evaluated with eval_val (lines 217–273) using torch.inference_mode() — read-only forward pass, no gradient updates on val data. No multi-epoch loop over val_tokens, no score-first guard, no is_last_chunk logic. TTT categories do not apply. ### 3. GPTQ calibration on val_tokens — CLEAN Lines 1229–1232 use val_tokens as calibration data for Hessian collection (collect_hessians). This is a post-training quantization step run after training completes, with torch.inference_mode() (line 336). The Hessians inform INT6 GPTQ weight rounding only — no gradient updates, no weight optimization targeting val loss. This is standard GPTQ calibration, permitted under competition rules. ### 4. Scored-region SLOT — not applicable No evidence of scored-region slot injection or look-ahead into the evaluation window. ### 5. Architecture summary 11-layer transformer with BigramHash(3072, 112), XSA on all layers, Value Residual, U-Net skips, Partial RoPE (16 dims), EMA(0.997), 10% corrupted context training (lines 1157–1162), Late QAT, INT6 GPTQ+lzma. All techniques operate on training data only and are causally legal. --- ## Conclusion No...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Terry and others added 30 commits March 24, 2026 21:06

pd: specify review iteration 1

7326a9a

pd: specify review iteration 2

e317f9e

pd: specify phase-review iteration 1

86c5b7a

pd: design review iteration 1

e7ca2dc

pd: design review iteration 2

d662200

pd: design handoff review iteration 1

fc3d771

pd: design revision - discovery-adjust cycle, observability, schema f…

5892207

…ixes

pd: design review iteration 1 (post-revision) - fix cycle versioning,…

1b54988

… feedback mechanism, disambiguation

pd: design review iteration 2 (post-revision) - fix I4 path, I2 appen…

d61d570

…d mode, C10 sentinel, diagnostic paths

pd: design - fix I5 versioned path, C5 parsing/overrides, cycle_statu…

b6d9222

…s field

pd: design - fix sentinel docs, I6 versioned path, cycle_status forma…

7847d9c

…tting

pd: design - address final phase-review warnings: spec overrides, sto…

14c9f27

…p_max_cycles, shard_variance note

pd: design - use pyenv+uv for dependency management, remove scripts/c…

8b34eb3

…ausal/requirements.txt

pd: plan review iteration 1 - add TDD ordering, import safety check, …

9164a54

…error handling, complexity labels

pd: plan review iteration 2 - S12 tests, import safety criteria, chec…

82217b5

…kpoint prereqs, FCI degenerate test

pd: plan phase-review iteration 1 - fix S5 deps, S12 ordering, S4a RE…

80be95c

…ADME output

pd: plan simplification - merge S4a/S4b, remove verbose rationale, ti…

f220d76

…ghten descriptions

pd: plan review fixes - checkpoint prereq, import criteria, syntax te…

4d003c5

…st, parsing, dep check

pd: tasks review iteration 1 - split T4/T11, add platform flag, funct…

3d1bed6

…ion refs

pd: tasks - fix T4→T4b deps, add R4.2 consolidation note to T17

ec6e45d

pd: finish feature — mark T1-T18 complete, T19 deferred

9d394c3

All 18 implementation tasks complete (117/117 tests passing). T19 (submission assembly) deferred pending experiment cycle results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pd: mark T19 complete (except H100 validation)

d9f7fa5

Terry and others added 28 commits March 25, 2026 13:29

fix: add sys.path for in-process trainer import in run_pipeline.py

eca41ee

perf: bump --fast iteration preset from 50 to 300 for longer-term signal

f85970e

fix: move _logger creation before activation patch to avoid NameError

21faf34

_logger was defined inside the training loop section but referenced earlier in the activation patch log line. Moved to top of function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add --cooldown flag to prevent GPU thermal throttling

98d65cf

30-second pause between experiments by default. Prevents Mac from overheating during long screening runs. Adjustable via --cooldown N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Non-record submission: sin^2 activation + causal screening pipeline

94ca331

Merge remote-tracking branch 'origin/main' into clthuang-dev

1c64387

docs: update README and submission.json for V3 submission

4d428fe

Updated with 4x MLP, per-group int6 bit-packed quantization, EMA, QAT, partial RoPE, and all SOTA fundamentals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: sin² activation + causal screening pipeline#877

Non-record: sin² activation + causal screening pipeline#877
clthuang wants to merge 61 commits intoopenai:mainfrom
clthuang:clthuang-dev

clthuang commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

clthuang commented Mar 26, 2026

Non-Record Submission: sin² Activation + Causal Screening Pipeline

Summary

Key Change

Why sin²?

Screening Results (24 interventions, Apple Silicon MLX)

Infrastructure

Status

Files

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: sin² activation + causal screening pipeline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants