[non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations by andrewmouldon · Pull Request #1035 · openai/parameter-golf

andrewmouldon · 2026-03-28T19:32:40Z

Summary

This PR introduces ASQU (Asymmetric Squared Unit), a per-channel activation that combines ReLU² with a learned PReLU-style negative branch.

Standard ReLU² suppresses negative inputs entirely, while fixed-slope LeakyReLU² uses the same negative-branch scale for every channel. ASQU instead learns a separate negative-branch scale for each feature dimension.

This gives each channel a small amount of activation-level flexibility with minimal parameter overhead.

ASQU outperforms the current strong 10-minute-track activation baseline, fixed-slope LeakyReLU², across all three seeds in the fixed-step evaluation.

It is not used in the timed-track stack because the learned β_i gradient adds an extra kernel launch, and the resulting throughput cost was not justified under the 10-minute constraint.

Motivation

Activation functions often apply the same nonlinear behavior across all channels.

In this setting, the strongest activation baseline was fixed-slope LeakyReLU², which improves over ReLU² by allowing negative inputs to contribute through a shared slope. However, that slope is still hard-coded and shared across all feature dimensions.

This assumes that all channels benefit from the same asymmetric response.

ASQU relaxes this assumption by allowing each channel to specialize its negative-branch behavior. Some channels may benefit from suppressing negative inputs, while others may benefit from responding to large inputs regardless of sign, or from allowing negative inputs to contribute with a different sign or magnitude.

Method

ASQU builds on ReLU² by adding a learned per-channel scaling parameter for the negative branch, similar in spirit to PReLU.

f_i(x) = x^2        if x > 0
f_i(x) = β_i x^2    if x ≤ 0

where:

β_i is a learned parameter for channel i
the positive branch matches ReLU²
the negative branch is learned independently per channel

This gives ASQU a continuum of activation behaviors:

β_i ≈ 0: ReLU²-like behavior, suppressing negative inputs
β_i > 0: magnitude-sensitive behavior, where large negative inputs can activate positively
β_i < 0: negative inputs produce modulated negative outputs

ASQU can be viewed as a squared PReLU-style activation: ReLU² provides the squared positive branch, while the learned β_i gives each channel control over its negative response.

Pseudocode

class ASQU:
    # one learned scalar per channel
    beta = learned_vector(dim)

    def forward(x):
        x2 = x ** 2
        return where(x > 0, x2, beta * x2)

Setup

Fixed 10k training steps
Same setup as the baseline
ASQU replaces the standard ReLU² activation
Minimal parameter overhead from one learned β_i per channel

Results

All runs use identical settings across three seeds, building off of the original naive baseline.

Model	Seed	Pre-quant BPB ↓	Post-quant BPB ↓	Size (bytes)
ReLU²	1337	1.2262	1.2328	15861272
LeakyReLU² (0.5)	1337	1.2243	1.2315	15861749
ASQU	1337	1.2236	1.2296	15895013
ReLU²	42	1.2276	1.2343	15856563
LeakyReLU² (0.5)	42	1.2247	1.2315	15862578
ASQU	42	1.2240	1.2309	15894743
ReLU²	2025	1.2253	1.2321	15853892
LeakyReLU² (0.5)	2025	1.2234	1.2302	15858384
ASQU	2025	1.2225	1.2295	15892158
Average (ReLU²)	—	1.2264	1.2331	15857242
Average (LeakyReLU²)	—	1.2241	1.2311	15860870
Average (ASQU)	—	1.2234	1.2300	15893971

ASQU provides a consistent improvement over both ReLU² and fixed-slope LeakyReLU².

Additional Experiments

Beta Analysis

The learned β_i values typically have a mean around 0.5, though this depends on initialization. This helps explain why fixed-slope asymmetric activations such as LeakyReLU² are already strong baselines.

However, there is substantial variation across channels. Some β_i values become moderately negative, while others grow larger than 1. This suggests that different features benefit from distinct activation behavior that a single shared slope cannot capture.

Learned Exponent

I also explored learning the activation exponent instead of fixing it to 2. This did not consistently improve final performance and introduced additional overhead, but it showed a consistent depth-dependent pattern:

early layers: exponent ≈ 1.4
middle layers: exponent ≈ 1.8
late layers: exponent ≈ 2.2

This suggests that different layers may benefit from different degrees of nonlinearity, with deeper layers favoring sharper activations.

Notes on Evaluation Setting

This PR evaluates ASQU under a fixed 10k step budget to isolate architectural effects from slight potential differences in data exposure. This gives a cleaner comparison when studying small changes such as activation functions.

MatoTeziTanka · 2026-04-11T20:12:13Z

Community Review — [non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'tkinter'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'tkinter'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

andrewmouldon added 5 commits March 28, 2026 14:11

Create README.md

2ee284b

Add files via upload

ebc69fe

Update README.md

8d339fe

Update README.md

47c2120

Update README.md

aa101a4

notapplica mentioned this pull request Mar 29, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

fahmitech mentioned this pull request Apr 2, 2026

Proposal: Validate ASQU on the March 22 10min/16MB control line #1247

Open

andrewmouldon added 3 commits May 4, 2026 22:55

Create submission.json

f326e48

Update submission.json

c8e5ba3

Update README.md

80e446c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations#1035

[non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations#1035
andrewmouldon wants to merge 8 commits intoopenai:mainfrom
andrewmouldon:asqu-only

andrewmouldon commented Mar 28, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewmouldon commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Method

Pseudocode

Setup

Results

Additional Experiments

Beta Analysis

Learned Exponent

Notes on Evaluation Setting

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — [non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewmouldon commented Mar 28, 2026 •

edited

Loading

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading