Non-record: Mini-Hymba hybrid attention+SSM heads (arXiv:2411.13676)#1961
Open
aparna-1407 wants to merge 5 commits intoopenai:mainfrom
Open
Non-record: Mini-Hymba hybrid attention+SSM heads (arXiv:2411.13676)#1961aparna-1407 wants to merge 5 commits intoopenai:mainfrom
aparna-1407 wants to merge 5 commits intoopenai:mainfrom
Conversation
- Integrate configurable Hymba layer with chunked SSM scan - Document 1-layer 800-step result and artifact metrics - Update submission metadata and training log for non-record 16MB run
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mini-Hymba: Hybrid Attention + SSM Architecture
Track: Non-record
Base: PR #1493 (SP8192 + 3-layer recurrence + parallel residuals + QK-Gain + legal TTT)
Status: Complete — val_bpb will be updated when full run completes
What this does
Replaces one CausalSelfAttention module, layer 4, with a Mini-Hymba hybrid block that runs transformer attention heads and Mamba-lite SSM heads in parallel on the same input, then concatenates their outputs.
The motivation comes from Hymba: attention heads provide high-resolution token recall, while SSM heads provide efficient context summarization through a recurrent state. In this miniature version, only one layer is hybridized because the 1-layer ablation trained faster and reached much better BPB than the earlier 3-layer variant.
Architecture changes
reduces forced-to-attend burden (Hymba paper §3.2)
redundant parameters from the artifact entirely (no dead weights in state_dict)
all run exactly as in the host script
Implementation
hymba_layer.pyis a self-contained drop-in patch. It registers CastedLinear, Rotary, and apply_rotary_emb from the host train_gpt.py script, so it avoids reimplementing RoPE or changing the compression/scoring path.The original sequential SSM scan was stable but slow. The current implementation uses a chunk-parallel scan that reduces Python loop overhead from 1024 token steps to 16 chunk steps for sequence length 1024.
Results
Training steps: 800
Training time: 877.4 seconds
Hardware: 1x NVIDIA RTX PRO 6000 Blackwell
Final unquantized val_loss: 2.4501
Final unquantized val_bpb: 1.4511
Final int8+zlib roundtrip val_loss: 2.49096816
Final int8+zlib roundtrip val_bpb: 1.47529166
Total int8+zlib submission size: 9,234,838 bytes
Peak allocated memory: 14,145 MiB
Artifact size:
Validation trajectory
step 0: val_bpb 4.1077
step 200: val_bpb 1.9420
step 400: val_bpb 1.6016
step 600: val_bpb 1.5044
step 800: val_bpb 1.4511
roundtrip: val_bpb 1.4753
Notes
The earlier 3-layer version trained stably but was slower and worse under wallclock. The 1-layer chunk-parallel version is the best current configuration. 3 layer results:
model_params: 16,984,514— fits within 16MB artifact budgetTotal submission size int8+zlib: 6,838,905 bytes— 6.8MB, well under limit (fewer iterations (only 200) and the 3-layer config used KV sharing, so layers 4 and 5 omitted/reused K/V projections. That reduced the raw quantized payload a lot. The 1-layer config has no sharing opportunity, so it keeps the normal layer’s K/V projections and ends up with a larger compressed payload)val_bpb 3.81post-quantization roundtrip (untrained baseline)Layer-position ablation at 800 steps:
Layer 4 produced both the best validation BPB and the smallest quantization penalty, so all longer runs use
HYMBA_LAYERS=4.References