[non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations#1035
[non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activations#1035andrewmouldon wants to merge 8 commits intoopenai:mainfrom
Conversation
Community Review — [non-record track] Asymmetric Squared Unit (ASQU): learning per-channel asymmetric activationsCompliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'tkinter'. Classification via |
Summary
This PR introduces ASQU (Asymmetric Squared Unit), a per-channel activation that combines ReLU² with a learned PReLU-style negative branch.
Standard ReLU² suppresses negative inputs entirely, while fixed-slope LeakyReLU² uses the same negative-branch scale for every channel. ASQU instead learns a separate negative-branch scale for each feature dimension.
This gives each channel a small amount of activation-level flexibility with minimal parameter overhead.
ASQU outperforms the current strong 10-minute-track activation baseline, fixed-slope LeakyReLU², across all three seeds in the fixed-step evaluation.
It is not used in the timed-track stack because the learned
β_igradient adds an extra kernel launch, and the resulting throughput cost was not justified under the 10-minute constraint.Motivation
Activation functions often apply the same nonlinear behavior across all channels.
In this setting, the strongest activation baseline was fixed-slope LeakyReLU², which improves over ReLU² by allowing negative inputs to contribute through a shared slope. However, that slope is still hard-coded and shared across all feature dimensions.
This assumes that all channels benefit from the same asymmetric response.
ASQU relaxes this assumption by allowing each channel to specialize its negative-branch behavior. Some channels may benefit from suppressing negative inputs, while others may benefit from responding to large inputs regardless of sign, or from allowing negative inputs to contribute with a different sign or magnitude.
Method
ASQU builds on ReLU² by adding a learned per-channel scaling parameter for the negative branch, similar in spirit to PReLU.
where:
β_iis a learned parameter for channeliThis gives ASQU a continuum of activation behaviors:
β_i ≈ 0: ReLU²-like behavior, suppressing negative inputsβ_i > 0: magnitude-sensitive behavior, where large negative inputs can activate positivelyβ_i < 0: negative inputs produce modulated negative outputsASQU can be viewed as a squared PReLU-style activation: ReLU² provides the squared positive branch, while the learned
β_igives each channel control over its negative response.Pseudocode
Setup
β_iper channelResults
All runs use identical settings across three seeds, building off of the original naive baseline.
ASQU provides a consistent improvement over both ReLU² and fixed-slope LeakyReLU².
Additional Experiments
Beta Analysis
The learned
β_ivalues typically have a mean around 0.5, though this depends on initialization. This helps explain why fixed-slope asymmetric activations such as LeakyReLU² are already strong baselines.However, there is substantial variation across channels. Some
β_ivalues become moderately negative, while others grow larger than 1. This suggests that different features benefit from distinct activation behavior that a single shared slope cannot capture.Learned Exponent
I also explored learning the activation exponent instead of fixing it to 2. This did not consistently improve final performance and introduced additional overhead, but it showed a consistent depth-dependent pattern:
This suggests that different layers may benefit from different degrees of nonlinearity, with deeper layers favoring sharper activations.
Notes on Evaluation Setting
This PR evaluates ASQU under a fixed 10k step budget to isolate architectural effects from slight potential differences in data exposure. This gives a cleaner comparison when studying small changes such as activation functions.