Non-record: sin² activation + causal screening pipeline#877
Non-record: sin² activation + causal screening pipeline#877clthuang wants to merge 61 commits intoopenai:mainfrom
Conversation
… feedback mechanism, disambiguation
…d mode, C10 sentinel, diagnostic paths
…p_max_cycles, shard_variance note
…ausal/requirements.txt
…error handling, complexity labels
…kpoint prereqs, FCI degenerate test
…ghten descriptions
…st, parsing, dep check
- Install causal-learn, statsmodels, networkx, graphviz, scipy, pytest via uv - Create scripts/causal/ and tests/causal/ directory structure - Implement common.py with 9 utility functions (load_model, compute_bpb, paired_ttest, etc.) - 25 tests passing, 1 skipped (load_model checkpoint integration test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on, quant gap T14 (statistical_analysis.py): Paired-seed ablation analysis with bootstrap CI, Holm-Bonferroni correction, decision gate, and platform transfer coefficient. T15 (token_loss_decompose.py): Per-token loss decomposition with frequency bucketing (top-100/mid/tail), boundary/mid-sequence classification, and decomposition verification. Integration requires checkpoint; unit tests use mock data. T16 (quant_gap_analysis.py): Pre/post quantization BPB gap computation with 3x threshold check against largest training effect. All 17 new tests pass. No regressions (102 total pass). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parallel implementation of 8 scripts covering DAG discovery, experiment pipeline, and diagnostic probes: - extract_interventions.py: 3-tier README parser, 0.90 field coverage (23 tests) - estimate_dag.py: expert DAG + FCI validation + cycle updates (9 tests) - experiment_runner.py: paired seed ablation with error handling (14 tests) - statistical_analysis.py: effect estimation with Holm-Bonferroni (8 tests) - token_loss_decompose.py: per-token loss attribution (4 tests) - quant_gap_analysis.py: pre/post quantization gap (5 tests) - influence_proxy.py: gradient inner product shard scoring (7 tests) - gradient_attribution.py: training loop instrumentation (7 tests) 102 tests passing, 1 skipped (checkpoint integration). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- scripts/causal/README.md: cycle protocol, CLI usage for all 9 scripts - identifiability_check.py: data quality assessment, confounded pairs, unexplored combinations - 14 new tests, 116 total passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ations - Fix estimate_dag.py import: common → scripts.causal.common - Update spec: consolidate shard_variance_check.py into influence_proxy.py - Mark T19 as deferred (depends on experiment cycle results) - Implement phase_correlations in gradient_attribution.py (was stub) - 116 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…utput - Test now looks for *_mlx_model.npz (actual output format) instead of .safetensors - Fix docstring in common.py to mention .npz - 117/117 tests passing, 0 skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 18 implementation tasks complete (117/117 tests passing). T19 (submission assembly) deferred pending experiment cycle results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- submission_assembly.py: builds competition-ready submission directory - Validates README sections, artifact size, submission.json schema - --dry-run produces dummy submission for testing the pipeline end-to-end - 15 new tests, 132 total passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Causal inference pipeline for parameter golf: - 11 scripts in scripts/causal/ (DAG discovery, experiment runner, diagnostics, submission assembly) - 132 tests, all passing - Discovery-adjust cycle: extract → DAG → experiment → analyze → submit
- run_pipeline.py creates SharedTrainingContext once, runs experiments in-process with warmup caching (skip warmup for same-architecture runs) - Falls back to subprocess on failure - Upgraded to Python 3.12, all deps managed via uv add - requirements.txt exported from uv Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Experiment checkpoints may have non-default Hyperparameters (e.g., NUM_KV_HEADS=2). The test now explicitly looks for default_ckpt_mlx_model.npz first, falling back to any checkpoint only if the default doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- experiment_runner.py: add --inprocess flag and inprocess_ctx param to run_condition() for in-process training with subprocess fallback - statistical_analysis.py: add skipped_seeds field to comparison output (tracks seeds dropped due to errors or None val_bpb) - extract_interventions.py: refactor 7 sequential regex patterns into table-driven _BASE_BPB_PATTERNS list (~50 lines → ~15 lines) - token_loss_decompose.py: rename aggregate_bpb → aggregate_bpb_approx to make the approximation explicit 132 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before: warmup_steps=20 × grad_accum_steps=8 = 160 forward+backward passes After: warmup_steps=1 × grad_accum_steps=1 = 1 forward+backward pass The _WARMED_ARCHITECTURES cache already skips warmup on same-architecture repeat runs. This change reduces the first-run cost from ~160 passes to ~1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- inprocess_trainer: log step progress and val_bpb at each validation - run_pipeline: set VAL_LOSS_EVERY=0 (validate only at last step) The full 62M-token validation set was running every 2-4 steps, dominating screening time. Now runs once at the end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- inprocess_trainer: collect step_losses per training step, add screening_mode param that skips 62M-token validation (uses train_loss as comparison metric instead — 10x faster for screening) - run_pipeline: pass screening_mode=True for screening experiments - plot_losses.py: new script for treatment vs control loss curve plots with per-seed lines, mean curves, and shaded std regions - 8 new tests (5 inprocess + 3 plot), 135 total passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- inprocess_trainer: monkey-patch MLP.__call__ with configurable activation (gelu, silu, sin, sin_sq, fan) + auto-restore after each run - run_pipeline: add "activation" search space with 4 variants - Treatment uses specified activation; control always uses baseline relu² - Inspired by FAN/FANformer (NeurIPS 2025): sin-based activations show 31% parameter efficiency gains at 1B scale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove FAN variant (dimension mismatch: concat doubles hidden dim) - Add Priority 3 in select_interventions: sweep search space entries not in DAG (e.g., activation variants), so they get tested even though "activation" is not a causal DAG node - Verified: 4 activation variants (gelu, silu, sin, sin_sq) produce divergent training dynamics after optimizer steps - Verified: monkey-patch restore works between treatment/control runs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…variants Implement two novel loss functions as monkey-patchable variants for the causal screening pipeline. Rho-1 masks easy tokens by max-logit threshold, focusing training on hard tokens. Adaptive-K predicts N+2 tokens on high-confidence positions (high logit margin), with warmup period. Both are zero-parameter, loss-only changes that compose with activation variants. Wired into inprocess_trainer and pipeline search space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --fast preset: 50 iters, 65K batch, 5 layers, 2 seeds (~18× faster) - --screen-batch and --screen-layers flags for custom reduction - Balanced reduction: treatment and control use identical reduced settings, preserving relative comparison validity - Control always matches treatment's layer count (fair comparison) - 144 tests passing Usage: python scripts/causal/run_pipeline.py --fast --max-cycles 4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_logger was defined inside the training loop section but referenced earlier in the activation patch log line. Moved to top of function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
30-second pause between experiments by default. Prevents Mac from overheating during long screening runs. Adjustable via --cooldown N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…results Training scripts (3 incremental scripts, env-var controlled): - train_gpt_r1.py: LeakyReLU², BigramHash, XSA, Value residual, U-Net skips - train_gpt_r2.py: + 7 MLP variants (FAN, DML-Gated, DML-Orth, FAN+DML, CausalWide, DML-CausalWide), token dropout, corrupted context, Barlow Twins, adversarial embedding masking, per-position loss - train_gpt_r3.py: + sliding window eval, Legal TTT Screening infrastructure: - scripts/run_full_screen.sh: smoke test + R1→R2→R3 sequential runner - scripts/run_screen.sh: per-round runner with --gpu/--gpus flags - scripts/run_benchmark.sh: SOTA vs our best configs - scripts/score_and_reorder_data.py: offline difficulty scoring for curriculum Key results (8xH100, 1 shard): - Our corrupted context (val_bpb=1.3009) beats SOTA openai#1 (1.3315) by -0.031 - DML-Gated MLP + corruption (1.3100) also beats both SOTAs - Data augmentation dominates: corruption > token dropout > graduated variants - Novel MLP architectures help: DML-Gated (-0.043 vs baseline) Tests: 116 tests (unit + integration), all passing Specs: formal math specs with verifiable DoDs for all 20 experiments Docs: full experiment log with ideas, references, and results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete rewrite of experiments.md with all screening results: - Master summary table: 28 experiments across 4 hardware configs - Phase 1 activation screen: 8 experiments (LeakyReLU best, sin² worst) - R1 technique stack: 4 experiments with 5090 + H100 results - R2 novel designs: 12 experiments (corrupted context best at 1.3004) - R3 eval tricks: TTT (-0.095 bpb), sliding window (no help) - Benchmark vs SOTA: our corrupted context beats both SOTA repros - Results analysis: technique rankings, gap analysis vs competition Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean train_gpt.py (1122 lines) with R1-5 stack + 10% corrupted context. Defaults hardcoded: 11L/3x, BigramHash 3072, XSA4, Value Residual. Stripped all experimental MLP variants for submission clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace int8+zlib (24.5MB, over 16MB limit) with int6 per-row quantization (clip_range=31, best-of-5 percentile search) + lzma preset=9 compression. Adapted from PR openai#414 GPTQ-lite approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9 changes: batch 786K, warmdown 3500, XSA all 11 layers, EMA(0.997), late QAT (STE at scale<0.15), Muon WD 0.04, partial RoPE 16/64, momentum 0.99/1500-warmup, grad clip 0.3. Int6 GPTQ-lite + lzma compression for <16MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…16MB 786K batch caused 130ms/step (vs 87ms at 524K), giving only 4616 steps and worse val_bpb (1.2817 vs 1.2381). Revert to 524K for ~6900 steps. Reduce bigram_dim 128→112 to save ~2MB in int6+lzma artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-row int6 (17.9MB) with per-group-64 bit-packed int6. 4 values per 3 bytes, best-of-5 percentile search per group. Empirically validated: 27M params → 8.3MB with lzma preset=9. QAT updated to per-group-64 to match quantization format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…0.3MB) With per-group int6 bit-packing, artifact is 8.3MB for 27M params. Upgrading to 4x MLP (2048 hidden) adds 5.8M params for ~10.3MB total, still well under 16MB limit with 5.7MB headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated with 4x MLP, per-group int6 bit-packed quantization, EMA, QAT, partial RoPE, and all SOTA fundamentals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collect per-input-channel activation magnitudes from validation data, scale weights by importance^alpha before int6 quantization, undo at dequant. Two independent scales: per-group quant scale + per-column AWQ scale. Local test shows AWQ needs real trained weights to help (random weights don't benefit). Needs H100 empirical validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace AWQ/per-group quantization with proven Full Hessian GPTQ:
- Hessian H=X^TX collected from validation data via forward hooks
- Cholesky factorization of H^{-1} for optimal error compensation
- Column reordering by Hessian diagonal magnitude
- Block-wise (128) quantization with cross-block error propagation
- Best-of-5 percentile search with error compensation
- Fallback to percentile-only for 1D/no-Hessian tensors
- Bit-packed int6 serialization + lzma compression
SOTA achieves 0.018 bpb quant loss with this approach.
Local test on random weights shows GPTQ needs real trained
weights to outperform simple percentile (expected behavior).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Surgical 2-change fork of 2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233: 1. Add CORRUPT_RATE env var (default 0.1) to Hyperparameters 2. Add 5-line corruption block before model(x,y) in training loop All other paper settings preserved: FA3, seq_len=2048, batch=786K, mlp_mult=3, XSA=last4, EMA=0.997, SWA, warmdown=3500. With CORRUPT_RATE=0 this script is bit-identical to the paper. With CORRUPT_RATE=0.1 it tests our novel technique on the SOTA baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Non-record: sin² activation + causal screening pipelineCompliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache --- ## Analysis ### 1. BigramHash — CLEAN (no target leakage) Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
Non-Record Submission: sin² Activation + Causal Screening Pipeline
Summary
train_gpt.py)Key Change
Why sin²?
Motivated by FAN/FANformer (NeurIPS 2025) showing 31% parameter efficiency with periodic activations at 1B scale. sin² preserves relu²'s output range (non-negative) and squaring structure, while eliminating the hard zero cutoff that kills 50% of activations.
Screening Results (24 interventions, Apple Silicon MLX)
Infrastructure
Built a 12-script automated experimentation pipeline (144 tests):
Status
Files
records/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/train_gpt.py— modified baseline with sin² activationrecords/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/README.md— detailed writeuprecords/track_non_record_16mb/2026-03-27_CausalScreening_SinSq_AdaptiveK/submission.json— metadata