Skip to content

Non-record: sin² activation + causal screening pipeline#877

Open
clthuang wants to merge 61 commits intoopenai:mainfrom
clthuang:clthuang-dev
Open

Non-record: sin² activation + causal screening pipeline#877
clthuang wants to merge 61 commits intoopenai:mainfrom
clthuang:clthuang-dev

Conversation

@clthuang
Copy link
Copy Markdown

Non-Record Submission: sin² Activation + Causal Screening Pipeline

Summary

  • sin² periodic activation replacing relu² in MLP layers (1-line change in train_gpt.py)
  • Discovered via an automated causal screening pipeline that tested 24 interventions
  • Novel explorations: adaptive multi-token prediction (variable-depth MTP) and Rho-1 selective loss masking

Key Change

# Before (relu²):
x = torch.relu(self.fc(x))
return self.proj(x.square())

# After (sin²):
return self.proj(torch.sin(self.fc(x)).square())

Why sin²?

Motivated by FAN/FANformer (NeurIPS 2025) showing 31% parameter efficiency with periodic activations at 1B scale. sin² preserves relu²'s output range (non-negative) and squaring structure, while eliminating the hard zero cutoff that kills 50% of activations.

Screening Results (24 interventions, Apple Silicon MLX)

Intervention Train Loss Delta Notes
softcap=20 -0.447 Strongest signal
dim=640 -0.311 Wider > deeper for short training
sin² activation -0.017 Competitive with GELU (-0.014) and SiLU (-0.018)
adaptive-K MTP +0.008 Needs longer training (warmup-dependent)
rho1 selective loss pending Threshold calibration needed

Infrastructure

Built a 12-script automated experimentation pipeline (144 tests):

  • Causal DAG extraction from leaderboard ablation data
  • Discovery-adjust cycle with paired-seed experiments
  • In-process MLX training with warmup caching (7× faster than subprocess)
  • Per-step loss tracking with curve plotting

Status

  • ✅ Local MLX screening complete (24 interventions)
  • ⏳ H100 validation pending compute credits
  • ⏳ 3-seed statistical significance pending

Files

  • records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/train_gpt.py — modified baseline with sin² activation
  • records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/README.md — detailed writeup
  • records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/submission.json — metadata

Terry and others added 30 commits March 24, 2026 21:06
- Install causal-learn, statsmodels, networkx, graphviz, scipy, pytest via uv
- Create scripts/causal/ and tests/causal/ directory structure
- Implement common.py with 9 utility functions (load_model, compute_bpb, paired_ttest, etc.)
- 25 tests passing, 1 skipped (load_model checkpoint integration test)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on, quant gap

T14 (statistical_analysis.py): Paired-seed ablation analysis with bootstrap
CI, Holm-Bonferroni correction, decision gate, and platform transfer coefficient.

T15 (token_loss_decompose.py): Per-token loss decomposition with frequency
bucketing (top-100/mid/tail), boundary/mid-sequence classification, and
decomposition verification. Integration requires checkpoint; unit tests use
mock data.

T16 (quant_gap_analysis.py): Pre/post quantization BPB gap computation with
3x threshold check against largest training effect.

All 17 new tests pass. No regressions (102 total pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parallel implementation of 8 scripts covering DAG discovery, experiment
pipeline, and diagnostic probes:

- extract_interventions.py: 3-tier README parser, 0.90 field coverage (23 tests)
- estimate_dag.py: expert DAG + FCI validation + cycle updates (9 tests)
- experiment_runner.py: paired seed ablation with error handling (14 tests)
- statistical_analysis.py: effect estimation with Holm-Bonferroni (8 tests)
- token_loss_decompose.py: per-token loss attribution (4 tests)
- quant_gap_analysis.py: pre/post quantization gap (5 tests)
- influence_proxy.py: gradient inner product shard scoring (7 tests)
- gradient_attribution.py: training loop instrumentation (7 tests)

102 tests passing, 1 skipped (checkpoint integration).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- scripts/causal/README.md: cycle protocol, CLI usage for all 9 scripts
- identifiability_check.py: data quality assessment, confounded pairs, unexplored combinations
- 14 new tests, 116 total passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ations

- Fix estimate_dag.py import: common → scripts.causal.common
- Update spec: consolidate shard_variance_check.py into influence_proxy.py
- Mark T19 as deferred (depends on experiment cycle results)
- Implement phase_correlations in gradient_attribution.py (was stub)
- 116 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…utput

- Test now looks for *_mlx_model.npz (actual output format) instead of .safetensors
- Fix docstring in common.py to mention .npz
- 117/117 tests passing, 0 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 18 implementation tasks complete (117/117 tests passing).
T19 (submission assembly) deferred pending experiment cycle results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- submission_assembly.py: builds competition-ready submission directory
- Validates README sections, artifact size, submission.json schema
- --dry-run produces dummy submission for testing the pipeline end-to-end
- 15 new tests, 132 total passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Causal inference pipeline for parameter golf:
- 11 scripts in scripts/causal/ (DAG discovery, experiment runner, diagnostics, submission assembly)
- 132 tests, all passing
- Discovery-adjust cycle: extract → DAG → experiment → analyze → submit
Terry and others added 28 commits March 25, 2026 13:29
- run_pipeline.py creates SharedTrainingContext once, runs experiments
  in-process with warmup caching (skip warmup for same-architecture runs)
- Falls back to subprocess on failure
- Upgraded to Python 3.12, all deps managed via uv add
- requirements.txt exported from uv

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Experiment checkpoints may have non-default Hyperparameters (e.g.,
NUM_KV_HEADS=2). The test now explicitly looks for default_ckpt_mlx_model.npz
first, falling back to any checkpoint only if the default doesn't exist.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- experiment_runner.py: add --inprocess flag and inprocess_ctx param to
  run_condition() for in-process training with subprocess fallback
- statistical_analysis.py: add skipped_seeds field to comparison output
  (tracks seeds dropped due to errors or None val_bpb)
- extract_interventions.py: refactor 7 sequential regex patterns into
  table-driven _BASE_BPB_PATTERNS list (~50 lines → ~15 lines)
- token_loss_decompose.py: rename aggregate_bpb → aggregate_bpb_approx
  to make the approximation explicit

132 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before: warmup_steps=20 × grad_accum_steps=8 = 160 forward+backward passes
After:  warmup_steps=1  × grad_accum_steps=1 = 1   forward+backward pass

The _WARMED_ARCHITECTURES cache already skips warmup on same-architecture
repeat runs. This change reduces the first-run cost from ~160 passes to ~1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- inprocess_trainer: log step progress and val_bpb at each validation
- run_pipeline: set VAL_LOSS_EVERY=0 (validate only at last step)
  The full 62M-token validation set was running every 2-4 steps,
  dominating screening time. Now runs once at the end.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- inprocess_trainer: collect step_losses per training step, add
  screening_mode param that skips 62M-token validation (uses train_loss
  as comparison metric instead — 10x faster for screening)
- run_pipeline: pass screening_mode=True for screening experiments
- plot_losses.py: new script for treatment vs control loss curve plots
  with per-seed lines, mean curves, and shaded std regions
- 8 new tests (5 inprocess + 3 plot), 135 total passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- inprocess_trainer: monkey-patch MLP.__call__ with configurable activation
  (gelu, silu, sin, sin_sq, fan) + auto-restore after each run
- run_pipeline: add "activation" search space with 4 variants
- Treatment uses specified activation; control always uses baseline relu²
- Inspired by FAN/FANformer (NeurIPS 2025): sin-based activations show
  31% parameter efficiency gains at 1B scale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove FAN variant (dimension mismatch: concat doubles hidden dim)
- Add Priority 3 in select_interventions: sweep search space entries
  not in DAG (e.g., activation variants), so they get tested even though
  "activation" is not a causal DAG node
- Verified: 4 activation variants (gelu, silu, sin, sin_sq) produce
  divergent training dynamics after optimizer steps
- Verified: monkey-patch restore works between treatment/control runs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…variants

Implement two novel loss functions as monkey-patchable variants for the
causal screening pipeline. Rho-1 masks easy tokens by max-logit threshold,
focusing training on hard tokens. Adaptive-K predicts N+2 tokens on
high-confidence positions (high logit margin), with warmup period.

Both are zero-parameter, loss-only changes that compose with activation
variants. Wired into inprocess_trainer and pipeline search space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --fast preset: 50 iters, 65K batch, 5 layers, 2 seeds (~18× faster)
- --screen-batch and --screen-layers flags for custom reduction
- Balanced reduction: treatment and control use identical reduced settings,
  preserving relative comparison validity
- Control always matches treatment's layer count (fair comparison)
- 144 tests passing

Usage:
  python scripts/causal/run_pipeline.py --fast --max-cycles 4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_logger was defined inside the training loop section but referenced
earlier in the activation patch log line. Moved to top of function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
30-second pause between experiments by default. Prevents Mac from
overheating during long screening runs. Adjustable via --cooldown N.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…results

Training scripts (3 incremental scripts, env-var controlled):
- train_gpt_r1.py: LeakyReLU², BigramHash, XSA, Value residual, U-Net skips
- train_gpt_r2.py: + 7 MLP variants (FAN, DML-Gated, DML-Orth, FAN+DML,
  CausalWide, DML-CausalWide), token dropout, corrupted context, Barlow Twins,
  adversarial embedding masking, per-position loss
- train_gpt_r3.py: + sliding window eval, Legal TTT

Screening infrastructure:
- scripts/run_full_screen.sh: smoke test + R1→R2→R3 sequential runner
- scripts/run_screen.sh: per-round runner with --gpu/--gpus flags
- scripts/run_benchmark.sh: SOTA vs our best configs
- scripts/score_and_reorder_data.py: offline difficulty scoring for curriculum

Key results (8xH100, 1 shard):
- Our corrupted context (val_bpb=1.3009) beats SOTA openai#1 (1.3315) by -0.031
- DML-Gated MLP + corruption (1.3100) also beats both SOTAs
- Data augmentation dominates: corruption > token dropout > graduated variants
- Novel MLP architectures help: DML-Gated (-0.043 vs baseline)

Tests: 116 tests (unit + integration), all passing
Specs: formal math specs with verifiable DoDs for all 20 experiments
Docs: full experiment log with ideas, references, and results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete rewrite of experiments.md with all screening results:
- Master summary table: 28 experiments across 4 hardware configs
- Phase 1 activation screen: 8 experiments (LeakyReLU best, sin² worst)
- R1 technique stack: 4 experiments with 5090 + H100 results
- R2 novel designs: 12 experiments (corrupted context best at 1.3004)
- R3 eval tricks: TTT (-0.095 bpb), sliding window (no help)
- Benchmark vs SOTA: our corrupted context beats both SOTA repros
- Results analysis: technique rankings, gap analysis vs competition

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean train_gpt.py (1122 lines) with R1-5 stack + 10% corrupted context.
Defaults hardcoded: 11L/3x, BigramHash 3072, XSA4, Value Residual.
Stripped all experimental MLP variants for submission clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace int8+zlib (24.5MB, over 16MB limit) with int6 per-row
quantization (clip_range=31, best-of-5 percentile search) + lzma
preset=9 compression. Adapted from PR openai#414 GPTQ-lite approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 changes: batch 786K, warmdown 3500, XSA all 11 layers,
EMA(0.997), late QAT (STE at scale<0.15), Muon WD 0.04,
partial RoPE 16/64, momentum 0.99/1500-warmup, grad clip 0.3.
Int6 GPTQ-lite + lzma compression for <16MB artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…16MB

786K batch caused 130ms/step (vs 87ms at 524K), giving only 4616 steps
and worse val_bpb (1.2817 vs 1.2381). Revert to 524K for ~6900 steps.
Reduce bigram_dim 128→112 to save ~2MB in int6+lzma artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-row int6 (17.9MB) with per-group-64 bit-packed int6.
4 values per 3 bytes, best-of-5 percentile search per group.
Empirically validated: 27M params → 8.3MB with lzma preset=9.
QAT updated to per-group-64 to match quantization format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…0.3MB)

With per-group int6 bit-packing, artifact is 8.3MB for 27M params.
Upgrading to 4x MLP (2048 hidden) adds 5.8M params for ~10.3MB total,
still well under 16MB limit with 5.7MB headroom.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated with 4x MLP, per-group int6 bit-packed quantization,
EMA, QAT, partial RoPE, and all SOTA fundamentals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collect per-input-channel activation magnitudes from validation data,
scale weights by importance^alpha before int6 quantization, undo at
dequant. Two independent scales: per-group quant scale + per-column
AWQ scale. Local test shows AWQ needs real trained weights to help
(random weights don't benefit). Needs H100 empirical validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace AWQ/per-group quantization with proven Full Hessian GPTQ:
- Hessian H=X^TX collected from validation data via forward hooks
- Cholesky factorization of H^{-1} for optimal error compensation
- Column reordering by Hessian diagonal magnitude
- Block-wise (128) quantization with cross-block error propagation
- Best-of-5 percentile search with error compensation
- Fallback to percentile-only for 1D/no-Hessian tensors
- Bit-packed int6 serialization + lzma compression

SOTA achieves 0.018 bpb quant loss with this approach.
Local test on random weights shows GPTQ needs real trained
weights to outperform simple percentile (expected behavior).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Surgical 2-change fork of 2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233:
1. Add CORRUPT_RATE env var (default 0.1) to Hyperparameters
2. Add 5-line corruption block before model(x,y) in training loop

All other paper settings preserved: FA3, seq_len=2048, batch=786K,
mlp_mult=3, XSA=last4, EMA=0.997, SWA, warmdown=3500.

With CORRUPT_RATE=0 this script is bit-identical to the paper.
With CORRUPT_RATE=0.1 it tests our novel technique on the SOTA baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: sin² activation + causal screening pipeline

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

--- ## Analysis ### 1. BigramHash — CLEAN (no target leakage) BigramHashEmbedding.bigram_hash at lines 662–668: python out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod The XOR mixes t[i] (current token) with t[i-1] (previous token). No target (next-token) is involved in the hash key. This is a standard causal bigram embedding. The ILLEGAL pattern (target XOR'd into hash key) is NOT present. ### 2. No TTT of any kind There is no test-time training anywhere in the file. The model is trained once on train tokens, evaluated with eval_val (lines 217–273) using torch.inference_mode() — read-only forward pass, no gradient updates on val data. No multi-epoch loop over val_tokens, no score-first guard, no is_last_chunk logic. TTT categories do not apply. ### 3. GPTQ calibration on val_tokens — CLEAN Lines 1229–1232 use val_tokens as calibration data for Hessian collection (collect_hessians). This is a post-training quantization step run after training completes, with torch.inference_mode() (line 336). The Hessians inform INT6 GPTQ weight rounding only — no gradient updates, no weight optimization targeting val loss. This is standard GPTQ calibration, permitted under competition rules. ### 4. Scored-region SLOT — not applicable No evidence of scored-region slot injection or look-ahead into the evaluation window. ### 5. Architecture summary 11-layer transformer with BigramHash(3072, 112), XSA on all layers, Value Residual, U-Net skips, Partial RoPE (16 dims), EMA(0.997), 10% corrupted context training (lines 1157–1162), Late QAT, INT6 GPTQ+lzma. All techniques operate on training data only and are causally legal. --- ## Conclusion No...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants