Skip to content

Commit d1a1c79

Browse files
CiprianFlorin-IfrimCiprian-Florin Ifrim
andauthored
Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary Asymmetric U-Net + NeoMuon + 4xrelu²MLP + Smear + Fact Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8 + Bit-packing LZMA + Stride-16 Eval - 2h (openai#641)
* Notable Non-Record Submission: 1.1239 BPB - 106.2M Binary U-Net (15L 768d 8192BPE relu² 4xMLP FP8 SmearGate, 50k steps) * Updated README.md for Non-record submission. --------- Co-authored-by: Ciprian-Florin Ifrim <ciprian-florin.ifrim@Ciprians-Mac-Studio-M1-Max.local>
1 parent ba48bf3 commit d1a1c79

10 files changed

Lines changed: 12694 additions & 0 deletions

File tree

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Notable Non-Record Submission: 1.1239 BPB — 106.2 Asymmetric Binary U-Net Transformer
2+
3+
**1-bit Quantisation + 15L (7 Encoder - 8 Decoder) + NeoMuon + 4x relu² MLP + SmearGate + Factored Tied Embedding + Poly5 Softcap + YaRN 2048 + 8192 BPE + FP8 QAT + LZMA + Stride-16 Sliding Eval**
4+
5+
**val_bpb: 1.1239** (sliding, seed=42) | **15.67 MB** artifact | 8×H100 SXM, 50k steps (~2.15h)
6+
7+
> **This is a **non-record submission** — training exceeds the 10-minute wallclock constraint (50,000 steps / ~2.15 hours). Submitted to demonstrate the compression frontier: 106.2 parameters in 15.67MB via 1-bit quantisation. Over 120M possible with FP4 (implemented) with a worse bpb. Full experiment log: [RESULTS.md](RESULTS.md). Complete training logs: [logs/](https://github.com/CiprianFlorin-Ifrim/openai-parameter-golf-submission/tree/main/logs/cuda).**
8+
9+
## Results (seed=42, 8×H100 SXM)
10+
11+
| Metric | Value |
12+
|--------|-------|
13+
| Sliding BPB (s16) | **1.1239** |
14+
| val_bpb | 1.1497 |
15+
| RT bpb | 1.1516 |
16+
| Steps | 50,000 |
17+
| ms/step | 155.3 |
18+
| Training time | 7,763s (~2.15h) |
19+
| optimal_T | 0.90 |
20+
| Artifact | 15,670,651 bytes (15.67MB) |
21+
| Parameters | 106,154,616 |
22+
23+
### Comparison to Ternary Submission
24+
25+
Binary reaches better absolute quality but requires circa 13x more training time. Within the 10-minute budget, binary's best fitting run (14L, 4,820 steps) scores 1.1824 sliding — 0.025 bpb worse than ternary (my previous record PR). The zero state is worth more at convergence than the 60% parameter density advantage.
26+
27+
The results document linked here and in my repo showcases all methods and sweeps applied to both Binary and Ternary Bitnets, which unfortunately are incompatible with many methods, such as Tversky Layers, EMA, Muon WD, LM Logit Head ranking and many more.
28+
29+
## Architecture
30+
31+
- 15 transformer layers, dim=768, 8 heads, 4 KV heads (GQA), head_dim=96
32+
- Binary quantisation: weights {-1, +1}, 1 bit/param, per-group (128) absmean scaling
33+
- 4x MLP expansion (hidden=3072) with **relu²** activation, fused gate+up projection
34+
- U-Net encoder/decoder with learned skip weights (ones-init) and per-block residual mix from input embedding
35+
- **SmearGate:** causal cumulative mean blending with learned tanh gate, zero-init for safe residual start
36+
- Factored tied embedding: 8192×254 bottleneck with learned projections
37+
- Polynomial softcap (degree 5, cap=10) with Z-loss regularisation (1e-4)
38+
- YaRN positional encoding (max_len=2048, ROPE_BASE=5000)
39+
- Fused QKV projection
40+
- FlashAttention-3 (Hopper native kernels)
41+
- 106.2M parameters, 15.67MB artifact (97.3M binary + 2.5M fp8 + 70KB code)
42+
43+
## Key Techniques
44+
45+
### Architecture
46+
- **Binary quantisation:** 1 bit/param packs 60% more parameters per MB than ternary (1.6 bits/param), allowing 15 layers vs 10 within similar budget
47+
- **4x relu² MLP:* relu² strictly dominates relu; 4x width outperforms 3x even with fewer layers at matched budget
48+
- **SmearGate:** blends each position with causal cumulative mean; adds 22ms/step overhead but provides -0.007 bpb at scale. Viable here because the run is not wallclock-constrained
49+
50+
### Training
51+
- **NeoMuon** with 3 Newton-Schulz steps optimizer
52+
- **50,000 steps unconstrained:** binary converges slower than ternary (my other #640, at 4,000 steps (the 10-minute equivalent) binary lags by 0.025 bpb. Extended training closes the gap and surpasses ternary, showcasing with "unlimited compute" the models can be quite powerful.
53+
- **524k batch tokens:**
54+
55+
### Evaluation
56+
- **Temperature scaling (T=0.90):** auto-calibrated grid
57+
- **Sliding window (stride=16):** evaluation protocol
58+
59+
### Compression
60+
- **Bit-packing + LZMA (preset=9):** binary weights pack at exactly 1 bit/param before LZMA entropy coding
61+
- **FP8 QAT (e4m3):** for non-binary parameters. Clean roundtrip, binary has no zero state, so `mean(|Q|)=1.0` always; no shrinkage correction needed
62+
- **No EMA:** despite clean binary roundtrip math, EMA still hurts quality by 0.03 bpb in practice
63+
64+
## Setup and Run
65+
66+
```bash
67+
# Environment setup (conda + Python 3.13 + PyTorch + FlashAttention-3 + Triton + dataset)
68+
bash setup.sh
69+
70+
# Activate and run
71+
conda activate golf
72+
SEED=42 bash run_cuda_binary.sh
73+
```
74+
75+
<details>
76+
<summary>Full run command</summary>
77+
78+
```bash
79+
RUN_ID=binary_run \
80+
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
81+
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
82+
ATTN_PROJ_TYPE=standard \
83+
LOGIT_HEAD_TYPE=standard \
84+
TVERSKY_MEMBERSHIP=sigmoid \
85+
TVERSKY_NUM_FEATURES=0 \
86+
TVERSKY_FEATURE_POOLS=0 \
87+
VOCAB_SIZE=8192 \
88+
BITNET_GROUP_SIZE=128 \
89+
BIGRAM_HASH=0 \
90+
EMBED_DIM=254 \
91+
TRAINING_DEPTH_RECURRENCE=0 \
92+
EVAL_DEPTH_RECURRENCE=0 \
93+
NUM_LAYERS=15 \
94+
MODEL_DIM=768 \
95+
NUM_KV_HEADS=4 \
96+
NUM_HEADS=8 \
97+
DIFF_ATTN=0 \
98+
MLP_MULT=4 \
99+
MLP_GROUPS=0 \
100+
MATRIX_OPTIMIZER=muon \
101+
ADAM_LR=0.05 \
102+
ADAM_WD=0.05 \
103+
MUON_BACKEND_STEPS=3 \
104+
MUON_MOMENTUM=0.95 \
105+
MUON_MOMENTUM_WARMUP_START=0.85 \
106+
MUON_MOMENTUM_WARMUP_STEPS=500 \
107+
MUON_WD=0.0 \
108+
MATRIX_LR=0.04 \
109+
SCALAR_LR=0.02 \
110+
TIED_EMBED_LR=0.02 \
111+
WARMDOWN_FRACTION=0.2 \
112+
LOGIT_SOFTCAP=10 \
113+
QK_GAIN_INIT=2.25 \
114+
ROPE_TYPE=yarn \
115+
YARN_MAX_LEN=2048 \
116+
ROPE_BASE=5000 \
117+
BATCH_TOKENS_START=0 \
118+
BATCH_SCHEDULE_FRACTION=0.33 \
119+
TRAIN_BATCH_TOKENS=524288 \
120+
SEQ_LEN_START=0 \
121+
SEQ_SCHEDULE_FRACTION=0.0 \
122+
TRAIN_SEQ_LEN=1024 \
123+
SMEAR=1 \
124+
ITERATIONS=50000 \
125+
WARMUP_STEPS=5 \
126+
MAX_WALLCLOCK_SECONDS=0 \
127+
VAL_LOSS_EVERY=0 \
128+
TRAIN_LOG_EVERY=500 \
129+
CHURN_LOG_EVERY=1000 \
130+
VAL_MAX_TOKENS=0 \
131+
TIE_EMBEDDINGS=1 \
132+
UNTIE_AT_FRACTION=0.00 \
133+
HEAD_LR=0.02 \
134+
CORR_WEIGHT_LR=0.02 \
135+
ACTIVATION=relu2 \
136+
SOFTCAP_TYPE=poly \
137+
MTP_HEADS=0 \
138+
REFINER=0 \
139+
REFINER_KERNEL=3 \
140+
SLIDING_EVAL=1 \
141+
SLIDING_EVAL_STRIDE=16 \
142+
SLIDING_BATCH_SIZE=256 \
143+
TEMP_SCALING=1 \
144+
FP_STORAGE=FP8 \
145+
EMA=0 \
146+
EMA_DECAY=0.995 \
147+
EMA_START_FRACTION=0.5 \
148+
SEED=42 \
149+
COMPILE_MODE=default \
150+
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 train_gpt_cuda_binary.py
151+
```
152+
153+
</details>
154+
155+
## Compliance
156+
157+
- [x] Artifact <=16,000,000 bytes (15,670,651)
158+
- [x] Sliding window eval stride=16
159+
- [x] No test-time training on validation data
160+
- [x] No network calls during evaluation
161+
- [x] No external compute
162+
- [x] Train time: **non-record submission** (7,763s/ 2.2h / 50,000 steps)

0 commit comments

Comments
 (0)