Skip to content

Commit 20a38ef

Browse files
authored
Merge branch 'openai:main' into main
2 parents 8dd9089 + 75700cb commit 20a38ef

70 files changed

Lines changed: 23204 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,16 @@ Happy training!
3030

3131
| Run | Score | Author | Summary | Date | Info |
3232
|-----|------:|--------|---------|------|------|
33+
| SP8192 + 3-Layer Recurrence + Parallel Residuals + Legal TTT | 1.0810 | bigbag | On PR #1493: 3-layer recurrence, parallel residuals, QK-Gain 5.25, and legal score-first TTT on the PR #1394 stack | 2026-04-09 | [info](records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/README.md) |
34+
| SP8192 + Parallel Residuals + Score-First TTT | 1.0822 | aryanbhosale | On PR #1477: parallel residuals on the PR #1413 SP8192 + legal score-first TTT stack | 2026-04-08 | [info](records/track_10min_16mb/2026-04-08_SP8192_ParallelResid_ScoreFirstTTT/README.md) |
35+
| SP8192 + QK-Gain 5 + Legal Score-First TTT | 1.0828 | dexhunter | On PR #1413: QK-Gain 5.0 + legal score-first TTT on the PR #1394 SP8192 stack | 2026-04-06 | [info](records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/README.md) |
36+
| SP8192 + Parallel Residuals + Hessian-Aware SDClip | 1.0835 | Robby Sneiderman | On PR #1412: parallel residuals, Hessian-aware SDClip, and progressive recurrence on the PR #1394 stack | 2026-04-06 | [info](records/track_10min_16mb/2026-04-06_SP8192_HessianSDClip_ProgressiveRecurrence/README.md) |
37+
| SP8192 + GPTQ Embeddings + Depth Recurrence + SDClip | 1.0856 | Kevin Clark | On PR #1394: SP8192, GPTQ embeddings, looped layers 4-5, MuonEq-R, and std-based GPTQ clipping | 2026-04-05 | [info](records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/README.md) |
38+
| SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R | 1.0897 | aryanbhosale | On PR #1334: SP4096 + depth recurrence + parallel residuals + MuonEq-R + QK-Gain 5.0 | 2026-04-04 | [info](records/track_10min_16mb/2026-04-04_SP4096_DepthRecurrence_ParallelResid_MuonEqR/README.md) |
39+
| MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ | 1.0912 | dexhunter | On PR #1285: MuonEq-R + layers 4-5 recurrence + higher weight decay + all-int6 GPTQ | 2026-04-03 | [info](records/track_10min_16mb/2026-04-03_MuonEqR_DepthRecurrence_WD090_AllInt6/README.md) |
40+
| 4096-Vocab + Larger Model + High WD + Simplifications | 1.0979 | Kevin Clark | On PR #1218: SP4096 + 4x MLP + high weight decay, with TTT, hash embeddings, SmearGate, and value residuals removed | 2026-04-01 | [info](records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/README.md) |
41+
| Parallel Residuals + Mini Depth Recurrence | 1.1063 | Marko Sisovic | On PR #1204: mini recurrence on layers 4-5 + parallel attention/MLP residual lanes + AR self-generated GPTQ calibration | 2026-03-31 | [info](records/track_10min_16mb/2026-03-31_ParallelResiduals_MiniDepthRecurrence/README.md) |
42+
| 11L AR Self-Gen GPTQ + XSA | 1.1147 | abaybektursun | On PR #1019: Self-Generated GPTQ Calibration Data + all-layer XSA on the PR #549 stack | 2026-03-25 | [info](records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/README.md) |
3343
| LeakyReLU² + Legal Score-First TTT + Parallel Muon | 1.1194 | abaybektursun | On PR #549: LeakyReLU(0.5)^2 + TTT + Parallel Muon on the PR #414 stack | 2026-03-23 | [info](records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.md) |
3444
| 11L EMA + GPTQ-lite + warmdown3500 | 1.1228 | signalrush | On PR #374: GPTQ-lite clip search + EMA, plus warmdown3500 and QAT@0.15 | 2026-03-22 | [info](records/track_10min_16mb/2026-03-22_11L_EMA_GPTQ-lite_warmdown3500_QAT015_1.1233/README.md) |
3545
| 11L Partial RoPE + LN Scale + EMA + XSA4 | 1.1248 | jfprincz | On PR #287: Partial RoPE (16/64) + layerwise LN scale | 2026-03-21 | [info](records/track_10min_16mb/2026-03-21_11L_XSA4_EMA_PartialRoPE_LateQAT_1.1248/README.md) |
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112
2+
3+
**val_bpb: 1.1147** (3-seed mean, std 0.0004) | **~15.91 MB** | 8×H100 SXM, 600s | No TTT
4+
5+
**This submission uses only AR (autoregressive) self-generated calibration data.** After training, the model autoregressively generates its own calibration tokens (64 seqs × 2048 tokens, temp=0.8). No val data and no train data are accessed during quantization.
6+
7+
**Improvement over current SOTA ([PR #549](https://github.com/openai/parameter-golf/pull/549), 1.1194 BPB):** −0.0078 nats (−0.0046 BPB)
8+
9+
## Results
10+
11+
| Seed | Steps | ms/step | Pre-quant BPB | **Sliding BPB** | Artifact |
12+
|------|-------|---------|---------------|-----------------|----------|
13+
| 314 | 6,927 | 86.6 | 1.1354 | **1.1151** | 15,863,278 |
14+
| 42 | 6,922 | 86.7 | 1.1349 | **1.1144** | 15,984,850 |
15+
| 999 | 6,917 | 86.8 | 1.1353 | **1.1148** | 15,876,310 |
16+
| **Mean** | | | | **1.1147** | |
17+
18+
Current SOTA (PR #549, exact 3-seed mean): **1.11937967 BPB** (**1.89002068 nats**). This run's exact 3-seed mean is **1.11473509 BPB** (**1.88217853 nats**). Delta: **−0.00784215 nats** (**−0.00464458 BPB**).
19+
20+
Using the exact per-seed scores from the PR #549 logs (`1.11922988`, `1.12002032`, `1.11888882`) and this run (`1.11508120`, `1.11437394`, `1.11475014`), Welch's t-test gives **t = -11.83**, **df ≈ 3.31**.
21+
22+
---
23+
24+
## Main Changes
25+
26+
The comparison baseline is [PR #549](https://github.com/openai/parameter-golf/pull/549), the current legal leaderboard entry at **1.1194 BPB**. The implementation lineage is closer to [PR #609](https://github.com/openai/parameter-golf/pull/609): this run keeps the XSA-all + Full GPTQ + selective-pruning stack, but uses AR self-generated GPTQ calibration (no external data), bumps BigramHash to **3072 × 112**, and uses `lzma preset=9`.
27+
28+
### 1. AR Self-Generated Full Hessian GPTQ
29+
30+
PR #549 used GPTQ-lite (diagonal Hessian approximation). We use Full Hessian GPTQ with Cholesky error compensation and column reordering — a strictly better quantizer.
31+
32+
The calibration problem: prior Full Hessian GPTQ implementations (PRs #535, #569, #593, #609) calibrated on training data, ruled illegal after the 600s window. We solve this by having the model generate its own calibration data. After training completes, the model autoregressively generates 64 sequences of 2048 tokens (temperature=0.8, fixed seed). Hessians H = X^T X are collected from these self-generated sequences. No val data, no train data accessed during quantization.
33+
34+
### 2. BigramHash 3072 × dim=112 (up from 1536)
35+
36+
Lineage: [PR #549](https://github.com/openai/parameter-golf/pull/549) (1536) → [PR #609](https://github.com/openai/parameter-golf/pull/609) (2048) → this run (**3072 × dim=112**). Fits under 16MB; going wider increased artifact pressure past the break-even point.
37+
38+
### 3. XSA on all 11 layers (up from last 4)
39+
40+
PR #549 applied XSA to the last 4 layers. Extending to all 11 layers forces cross-position information mixing from layer 0 at zero parameter cost. Source: [PR #478](https://github.com/openai/parameter-golf/pull/478) by @gowtham0992.
41+
42+
### Dropped: TTT
43+
44+
PR #549 used Legal Score-First TTT for −0.0025 BPB. On this stack, TTT is neutral or negative (25 failed attempts across two stacks — see our [PR #756](https://github.com/openai/parameter-golf/pull/756)). The Full Hessian GPTQ improvement more than compensates for dropping TTT.
45+
46+
---
47+
48+
## Architecture
49+
50+
| Component | Setting | First introduced by |
51+
|-----------|---------|---------------------|
52+
| Layers | 11 (512d, 8 GQA heads, 4 KV heads) | Baseline |
53+
| MLP | 3× (1536) with LeakyReLU(0.5)² | [#493](https://github.com/openai/parameter-golf/pull/493) @parinzee |
54+
| Attention | XSA on all 11 layers | [#478](https://github.com/openai/parameter-golf/pull/478) @gowtham0992 |
55+
| BigramHash | **3072 × dim=112** | **This work** (concept: [#162](https://github.com/openai/parameter-golf/pull/162) @raahilshah) |
56+
| RoPE | Partial (16/64 dims) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
57+
| LN Scale | 1/√(layer+1) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
58+
| VE128 | Layers 9-10 | [#374](https://github.com/openai/parameter-golf/pull/374) @unnir |
59+
| SmearGate | Position-mixing gate | [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman |
60+
| U-Net skips | Encoder-decoder connections | [#289](https://github.com/openai/parameter-golf/pull/289) |
61+
| Weight avg | EMA(0.997) + Tight SWA(every 50) | [#401](https://github.com/openai/parameter-golf/pull/401) @newjordan |
62+
| Quantization | **Full Hessian GPTQ int6 (AR self-gen calibration)** | **This work** (GPTQ: [#535](https://github.com/openai/parameter-golf/pull/535) @raahilshah) |
63+
| Compression | LZMA preset=9 | [#160](https://github.com/openai/parameter-golf/pull/160) @ChaseWNorton |
64+
| Warmdown | 4000 iterations | [#364](https://github.com/openai/parameter-golf/pull/364) @shikhar1729 |
65+
| Optimizer | **Parallel Muon + Parameter Banking** | **[#399](https://github.com/openai/parameter-golf/pull/399) @abaybektursun** |
66+
| Late QAT | STE at LR scale < 0.15 | [#286](https://github.com/openai/parameter-golf/pull/286) @chris-buckley |
67+
| Selective pruning | ±1 values by reconstruction error | [#609](https://github.com/openai/parameter-golf/pull/609) @saml212 |
68+
| Flash Attention 3 | Hopper warp-specialized kernels | [#122](https://github.com/openai/parameter-golf/pull/122) @mtybadger |
69+
70+
## Requirements
71+
72+
**Flash Attention 3 (Hopper) is required.** The script imports `flash_attn_interface` directly and was run with PyTorch 2.9.1+cu128.
73+
74+
```bash
75+
pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
76+
pip install sentencepiece zstandard
77+
python3 -c "from flash_attn_interface import flash_attn_func; import sentencepiece, zstandard; print('deps OK')"
78+
```
79+
80+
## Run Command
81+
82+
```bash
83+
BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
84+
TARGET_MB=15.9 SEED=314 \
85+
torchrun --standalone --nproc_per_node=8 train_gpt.py
86+
```
87+
88+
## Lineage
89+
90+
```
91+
PR #549 (Legal SOTA, 1.1194) — our Parallel Muon base with LeakyReLU² + legal TTT
92+
└── This work adds:
93+
├── AR self-gen GPTQ calibration (no external data during quantization)
94+
├── BigramHash 3072 × 112 (wider setting that still fits under 16MB)
95+
├── XSA-all (from #478/@gowtham0992, applied via #609/@saml212)
96+
├── Selective ±1 pruning (from #609/@saml212)
97+
├── warmdown=4000, LZMA=9 (from #364/@shikhar1729, #160/@ChaseWNorton)
98+
└── Guided by PR #670 negative results (30+ failed experiments)
99+
```
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# FlashAttention 3 must be installed separately; see README.md
2+
sentencepiece
3+
zstandard
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
{
2+
"author": "abaybektursun",
3+
"github_id": "abaybektursun",
4+
"name": "AR Self-Gen GPTQ + XSA-all + BigramHash 3072x112",
5+
"blurb": "11L XSA-all + Full Hessian GPTQ with autoregressive self-generated calibration (no val/train data accessed during quantization) + selective-pruning stack. BigramHash(3072,112), warmdown=4000, lzma preset=9. 3-seed exact mean: 1.11473509 BPB / 1.88217853 nats, beating PR549's exact 3-seed mean 1.11937967 BPB / 1.89002068 nats by 0.00784215 nats (Welch t=-11.83, df=3.31).",
6+
"date": "2026-03-25",
7+
"track": "10min_16mb",
8+
"val_loss": 1.88217853,
9+
"val_bpb": 1.11473509,
10+
"val_loss_std": 0.00059750,
11+
"val_bpb_std": 0.00035387,
12+
"seeds": [314, 42, 999],
13+
"seed_results": {
14+
"314": {
15+
"val_loss": 1.88276292,
16+
"val_bpb": 1.11508120,
17+
"artifact_bytes": 15863278,
18+
"steps": 6927,
19+
"step_avg_ms": 86.6
20+
},
21+
"42": {
22+
"val_loss": 1.88156874,
23+
"val_bpb": 1.11437394,
24+
"artifact_bytes": 15984850,
25+
"steps": 6922,
26+
"step_avg_ms": 86.7
27+
},
28+
"999": {
29+
"val_loss": 1.88220393,
30+
"val_bpb": 1.11475014,
31+
"artifact_bytes": 15876310,
32+
"steps": 6917,
33+
"step_avg_ms": 86.8
34+
}
35+
},
36+
"comparison_baseline_pr": 549,
37+
"implementation_lineage_pr": 609,
38+
"negative_results_pr": 670,
39+
"delta_vs_pr549_nats": -0.00784215,
40+
"delta_vs_pr549_bpb": -0.00464458,
41+
"t_statistic": -11.8339,
42+
"welch_df": 3.3063,
43+
"artifact_bytes_mean": 15908146,
44+
"artifact_bytes_max": 15984850,
45+
"bytes_total": 15984850,
46+
"train_steps_mean": 6922.00,
47+
"step_avg_ms_mean": 86.69,
48+
"hardware": "8xH100 80GB SXM",
49+
"pytorch_version": "2.9.1+cu128",
50+
"cuda_version": "12.8",
51+
"flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
52+
"calibration": "AR self-generated (64 seqs x 2048 tokens, temp=0.8, no external data)",
53+
"technique_summary": "AR self-gen GPTQ calibration + XSA-all + BigramHash 3072x112 + Parallel Muon + LZMA9"
54+
}

0 commit comments

Comments
 (0)