Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Non-record: 11L FullGPTQ + XSA-all + BigramHash 3072×112

**Track**: 10min_16mb | **Author**: AVINASH0052 | **Date**: 2026-04-08

## Results

| Seed | val_bpb | val_loss | artifact_bytes |
|------|---------|----------|----------------|
| 1337 | **1.11564047** | 1.88370722 | 15,832,508 |

- Post-EMA (before GPTQ): val_bpb 1.1350
- After int6 GPTQ + sliding-window exact eval (stride=64): **val_bpb 1.11564047**
- Steps: 6891 | Avg step time: 87.08ms | FA3: True
- Hardware: 8×H100 80GB SXM

## Architecture

| Component | Detail |
|-----------|--------|
| Layers / dim | 11L, 512d |
| Attention heads | 8H query / 4KV (GQA) |
| MLP | 3× expansion (1536 hidden), LeakyReLU(0.5)² |
| XSA | All 11 layers — drops self-value projection |
| Hash Embedding | BigramHash 3072×112 |
| Pos Encoding | Partial RoPE (16 of 64 head dims) |
| Skip Connections | U-Net style: layers 0↔10, 1↔9, 2↔8 |
| Value Embed | VE128 re-injection at layers 9, 10 |
| LN Scaling | 1/√(L+1) per layer — deeper layers see smaller-norm inputs |
| SmearGate | Learned position mixing gate on embedding |
| Logit softcap | 30.0 |
| Tied embeddings | Token embedding = LM head (transposed) |
| Total params | ~27M |

## Training

| Setting | Value |
|---------|-------|
| Optimizer | Parallel Muon (8-GPU) + AdamW for embeddings |
| Parameter Banking | 4 contiguous 3D banks (qo, kv, mlp_up, mlp_down) |
| Batch | 786,432 tokens/step, seq_len=2048 |
| EMA | α=0.997, tight SWA every 50 steps when lr_scale < 0.2 |
| Late QAT | STE fake-quant activates when lr_scale < 0.15 (step 6299) |
| Warmdown | 4000 iters (wallclock-adaptive) |
| Grad clip | 0.3 |
| Max wallclock | 600s (10 min) |

## Post-Training Quantization (GPTQ)

| Step | Detail |
|------|--------|
| Calibration | AR self-generated: 64 seqs × 2048 tokens, temp=0.8 |
| Hessian | Collected across all 68 quantizable layers |
| Method | Full Hessian GPTQ int6: Cholesky + column reordering + block error compensation |
| Clip search | 5 percentiles tried per weight matrix, best MSE wins |
| Pruning | Selective ±1 pruning (model fit in budget — no pruning applied) |
| Compression | LZMA preset=9 |
| Serialized model | 15,750,244 bytes (int6 + LZMA) |
| Code | 82,264 bytes |
| **Total artifact** | **15,832,508 bytes** |

## How to Run

### Leaderboard run (8×H100 SXM)
```bash
pip install flash_attn_3 --no-deps --find-links \
https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
cd records/track_10min_16mb/2026-04-08_11L_FullGPTQ_XSA11_BigramHash3072
SEED=1337 bash run_leaderboard_8xh100.sh
```

### Smoke test (1 GPU, ~5 min)
```bash
bash run_smoke_1gpu.sh
```

## PR

[openai/parameter-golf#1473](https://github.com/openai/parameter-golf/pull/1473)
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env bash
set -euo pipefail

# Leaderboard run: 8×H100 SXM, 10 minutes
# This is the full submission configuration

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"

export DATA_PATH="$REPO_ROOT/data/datasets/fineweb10B_sp1024"
export TOKENIZER_PATH="$REPO_ROOT/data/tokenizers/fineweb_1024_bpe.model"

export NUM_LAYERS=11
export MAX_WALLCLOCK_SECONDS=600
export WARMDOWN_ITERS=4000
export WARMUP_STEPS=20
export TRAIN_BATCH_TOKENS=786432
export TRAIN_SEQ_LEN=2048
export EVAL_SEQ_LEN=2048
export VAL_LOSS_EVERY=4000
export TRAIN_LOG_EVERY=500
export ITERATIONS=20000
export EVAL_STRIDE=64
export SEED=${SEED:-1337}
export RUN_ID="leaderboard_${SEED}"
export TARGET_MB="15.9"

echo "=== LEADERBOARD RUN: 11L FullGPTQ + XSA + BigramHash (8x H100 SXM, 10min) ==="
echo "SEED=$SEED"
echo "SCRIPT_DIR=$SCRIPT_DIR"
echo "REPO_ROOT=$REPO_ROOT"
echo "DATA_PATH=$DATA_PATH"

cd "$SCRIPT_DIR"
torchrun --standalone --nproc_per_node=8 train_gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env bash
set -euo pipefail

# Smoke test: 1 GPU, 2 minutes, reduced settings
# Verifies the training + GPTQ + eval pipeline works end-to-end

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"

export DATA_PATH="$REPO_ROOT/data/datasets/fineweb10B_sp1024"
export TOKENIZER_PATH="$REPO_ROOT/data/tokenizers/fineweb_1024_bpe.model"

export NUM_LAYERS=11
export MAX_WALLCLOCK_SECONDS=120
export WARMDOWN_ITERS=800
export WARMUP_STEPS=5
export TRAIN_BATCH_TOKENS=262144
export TRAIN_SEQ_LEN=1024
export EVAL_SEQ_LEN=1024
export VAL_LOSS_EVERY=1000
export TRAIN_LOG_EVERY=50
export ITERATIONS=20000
export EVAL_STRIDE=64
export SEED=1337
export RUN_ID="smoke_1gpu"
export TARGET_MB="15.9"

echo "=== SMOKE TEST: 11L FullGPTQ + XSA + BigramHash (1 GPU, ~2min) ==="
echo "SCRIPT_DIR=$SCRIPT_DIR"
echo "REPO_ROOT=$REPO_ROOT"
echo "DATA_PATH=$DATA_PATH"

cd "$SCRIPT_DIR"
python train_gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"author": "AVINASH0052",
"github_id": "AVINASH0052",
"name": "11L FullGPTQ + XSA-all + BigramHash 3072x112",
"blurb": "11L 512d GQA 8H/4KV, LeakyReLU(0.5)^2 MLP 3x, Parameter Banking + Parallel Muon, BigramHash 3072x112, XSA all 11 layers, SmearGate, Partial RoPE 16/64, LN Scale 1/sqrt(L+1), VE128 layers 9-10, U-Net skips, EMA(0.997) + Tight SWA, Late QAT, Full Hessian GPTQ int6 with AR self-gen calibration, Selective +/-1 pruning, LZMA-9, Sliding window stride=64",
"date": "2026-04-08",
"track": "10min_16mb",
"val_loss": 1.88370722,
"val_bpb": 1.11564047,
"artifact_bytes": 15832508,
"steps": 6891,
"step_avg_ms": 87.08,
"seeds": [1337],
"seed_results": {
"1337": {
"val_loss": 1.88370722,
"val_bpb": 1.11564047,
"artifact_bytes": 15832508,
"steps": 6891,
"post_ema_val_bpb": 1.1350,
"gptq_roundtrip_val_bpb": 1.13972881,
"sliding_window_val_bpb": 1.11564047
}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.6.0",
"cuda_version": "12.4",
"technique_summary": "Full Hessian GPTQ int6 + AR self-gen calibration + XSA-all + BigramHash 3072x112 + Parallel Muon + Parameter Banking + LZMA9",
"train_command": "SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py"
}
Loading