Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
b4f0b51
Port QUANT_ONLY_CHECKPOINT mode to SP8192 base
Apr 20, 2026
1047be9
Port mixed-regime GPTQ calibration to SP8192 base (CALIB_SPLIT_BY_MOD…
Apr 20, 2026
52b0a80
Port per-layer QK-Gain init schedule to SP8192 base (QK_GAIN_INIT_SCH…
Apr 20, 2026
27b07e0
Add overnight run launch playbook (LAUNCH_COMMANDS.md)
Apr 20, 2026
4bf66ae
Sprint status: SP8192 rebase + 3 levers ported, overnight run pending
Apr 20, 2026
8651ffe
Submission-clean: SP8192 + QK-Gain init schedule (compressed)
Apr 20, 2026
4563084
Add SUBMISSION_STATUS.md with artifact sizes and submission checklist
Apr 20, 2026
1cced2e
Add 3-seed training results (val_bpb mean 1.07060) and summary
Apr 21, 2026
6d43f03
Add PR1797+4-lever submission and SP8192 QK-Gain insurance submission
Apr 25, 2026
f4d41f3
Smoke C validated: 4-lever build works under 16M cap
TanishGudise Apr 27, 2026
cbf3fc6
Optimize run command: LQER rank=6/top_k=5, TTT LoRA rank=128, GPTQ ca…
Apr 28, 2026
b302900
Add --skip-docs, --start-shard-train, --max-docs to prepare_caseops_d…
Apr 28, 2026
e954339
Save dexhunter seed 314 reproduction artifact + logs
TanishGudise Apr 28, 2026
47f8b1d
BOS SmearGate leak fix + DocStartSequenceLoader for GPTQ calibration
Apr 29, 2026
0064565
Sprint logs S1-S15 including PR1855 hybrid (S13), parity check (S14),…
TanishGudise Apr 29, 2026
6289aa8
Night 3 sprint summary including pod recovery instructions
TanishGudise Apr 29, 2026
ae49152
Final S15 result and tomorrow's plan - S14 1.06067 is best, BOS fix r…
TanishGudise Apr 29, 2026
16dcac8
Add TTT_LORA_EMA_DECAY env gate (per-batch EMA across chunks for stab…
Apr 29, 2026
cf45bd5
S16/S17/S18 sweep logs from pod sprint
TanishGudise Apr 29, 2026
e5d010c
S20 OOM (TTT EMA), S21=1.05961 (NUM_PHASES=2 tied with S16), S22=1.05…
TanishGudise Apr 29, 2026
0befceb
Add TTT_UPDATE_EVERY env gate (gradient accumulation across N chunks;…
Apr 29, 2026
e8c1f9d
Add COMPRESSOR=pergroup: role-bucketed lrzip/ZPAQ compression
Apr 30, 2026
3218374
Add TTT_GRAD_STEPS_CLEAN=1 for independent per-step gradient zeroing
Apr 30, 2026
25e6722
Fix lrzip -d: remove unsupported -k flag (lrzip 0.651)
Apr 30, 2026
8bb1eb7
Add pergroup roundtrip diagnostic scripts
Apr 30, 2026
3b68277
Port PR #1145 n-gram tilt (v21) onto grad-steps-clean base
Apr 30, 2026
6a3448f
Sprint sweep logs S36-S54: full lever exploration through n-gram tilt…
TanishGudise Apr 30, 2026
dcb6fc2
Validation results: 3-seed mean 1.05759 BEATS #1967 (1.05851) by -0.0…
TanishGudise Apr 30, 2026
d8969c6
Compliance fix: disable within-doc and word-start experts (C1 violation)
May 1, 2026
fad9a50
Add AsymLogit Rescale (PR #1923): independent pos/neg softcap at eval
May 1, 2026
5d6aea2
S57: token-only ngram tilt + AsymLogit Rescale = 1.05759 single seed
TanishGudise May 1, 2026
14d45d9
S61 seed 42 (without EVAL_INCLUDE_TAIL): 1.05645 / 470.9s eval / val_…
TanishGudise May 1, 2026
cab2257
Sprint logs: S15-S62 + validation runs
TanishGudise May 1, 2026
21981c1
S65 (compliant submission candidate): S58 base + NGRAM_HINT_PRECOMPUT…
TanishGudise May 1, 2026
6397acc
S66 fail (TOKEN_ORDER=8): 1.05989/632.9s NON-COMPLIANT, both BPB and …
TanishGudise May 1, 2026
d6f6146
Final submission: 3-seed mean 1.05670 BPB (S67 = NUM_PHASES=1)
TanishGudise May 1, 2026
3aaace3
Sprint debris: rejected runs (S64 TAIL fail, S65 seed 42 over cap, UU…
TanishGudise May 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,19 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/*.pt
*.ptz
# Large model artifacts
final_model.pt
*.pt

# Compiled binaries
*.so
records/track_10min_16mb/*/lib*.so

# Process pids and download logs
logs/download.pid
logs/hf_download_*.log

# Empty/garbage files
logs/0
176 changes: 176 additions & 0 deletions LAUNCH_COMMANDS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Overnight Run Launch — Manual Playbook

Run from SSH into the RunPod box, in the `/workspace/parameter-golf` directory.
Time required: ~20 minutes (10 min smoke tests, 5 min launch, 5 min verify).

## Step 1: Pull the latest code

```bash
cd /workspace/parameter-golf
git fetch origin
git checkout sp8192-rebase
git pull origin sp8192-rebase
git log --oneline -10 # confirm you see: rebase + 3 lever commits
```

Expected commits (top 4):
```
Port per-layer QK-Gain init schedule to SP8192 base (QK_GAIN_INIT_SCHEDULE)
Port mixed-regime GPTQ calibration to SP8192 base (CALIB_SPLIT_BY_MODULE)
Port QUANT_ONLY_CHECKPOINT mode to SP8192 base
Gitignore logs dir, *.pt, *.ptz checkpoint files
```

## Step 2: Environment sanity checks

```bash
nvidia-smi # confirm H100, no other users
df -h /workspace # confirm >50GB free
python --version # should be Python 3.10+
pip show torch | grep Version # should be 2.11.0+cu130 or compatible
pip show flash-attn 2>/dev/null | grep Version # FA3 must be installed
pip show sentencepiece 2>/dev/null | grep Version # required for SP8192
pip show brotli 2>/dev/null | grep Version # required for compression
```

If flash-attn or brotli is missing: `pip install flash_attn_3-3.0.0 brotli sentencepiece` (check requirements.txt first).

## Step 3: Start tmux

```bash
tmux new -s pg_overnight
```

## Step 4: GPU smoke test — base SP8192 (100 iters, ~3 min)

Run the clean SP8192 base to confirm the stack is functional:

```bash
mkdir -p logs
MAX_WALLCLOCK_SECONDS=180 \
DATA_DIR=/workspace/parameter-golf/data \
python records/track_10min_16mb/2026-04-05_SP8192_GPTQ-Embeddings_SDClip_Loop45x2/train_gpt_human.py \
2>&1 | tee logs/sp8192_base_smoke_$(date +%Y%m%d_%H%M).log
```

Pass criteria:
- Loss decreasing for 50+ steps
- No CUDA OOM
- No import errors
- Exits cleanly (no NaN/Inf)

If this fails, DO NOT proceed. Debug first.

## Step 5: GPU smoke test — ported file with QK-Gain schedule (50 iters, ~3 min)

```bash
QK_GAIN_INIT_SCHEDULE="2.0,2.5,3.0,3.5,4.0,4.5,4.5,4.0,3.5,3.0,2.5" \
MAX_WALLCLOCK_SECONDS=180 \
DATA_DIR=/workspace/parameter-golf/data \
python train_gpt_sp8192_opt.py \
2>&1 | tee logs/sp8192_opt_qkgain_smoke_$(date +%Y%m%d_%H%M).log
```

Pass criteria:
- Loss decreasing (no explosion from aggressive gain values)
- No NaN/Inf in first 50 steps

If loss explodes, use the conservative fallback schedule instead:
```bash
QK_GAIN_INIT_SCHEDULE="1.5,1.7,2.0,2.2,2.5,2.5,2.3,2.0,1.8,1.6,1.5"
```

## Step 6: Verify QUANT_ONLY_CHECKPOINT mode (optional, ~15 min)

If you have a checkpoint from a prior run (`final_model.pt`):

```bash
QUANT_ONLY_CHECKPOINT=/workspace/parameter-golf/final_model.pt \
DATA_DIR=/workspace/parameter-golf/data \
python train_gpt_sp8192_opt.py \
2>&1 | tee logs/sp8192_opt_quant_only_test_$(date +%Y%m%d_%H%M).log
```

Expected: skips training, prints `[QUANT_ONLY] Loading checkpoint`, runs GPTQ, prints BPB.

## Step 7: Launch overnight run (nohup, backgrounded)

```bash
QK_GAIN_INIT_SCHEDULE="2.0,2.5,3.0,3.5,4.0,4.5,4.5,4.0,3.5,3.0,2.5" \
DATA_DIR=/workspace/parameter-golf/data \
SEED=42 \
nohup python train_gpt_sp8192_opt.py \
> logs/sp8192_opt_overnight_$(date +%Y%m%d_%H%M).log 2>&1 &

echo $! > logs/sp8192_opt_overnight.pid
echo "PID: $(cat logs/sp8192_opt_overnight.pid)"
```

## Step 8: Verify it's running (watch for 2 minutes)

```bash
sleep 120
tail -n 30 logs/sp8192_opt_overnight_*.log | tail -30
ps -p $(cat logs/sp8192_opt_overnight.pid)
nvidia-smi # GPU should be >80% utilized
```

## Step 9: Note pod details

```bash
mkdir -p logs
cat >> logs/OVERNIGHT_RUN_NOTES.md << EOF
Pod ID: ${RUNPOD_POD_ID:-unknown}
Hostname: $(hostname)
Started: $(date)
PID: $(cat logs/sp8192_opt_overnight.pid 2>/dev/null)
Log: $(ls logs/sp8192_opt_overnight_*.log 2>/dev/null | tail -1)
QK_GAIN_INIT_SCHEDULE: 2.0,2.5,3.0,3.5,4.0,4.5,4.5,4.0,3.5,3.0,2.5
EOF
cat logs/OVERNIGHT_RUN_NOTES.md
```

## Step 10: Detach tmux and sleep

Detach: press **Ctrl-B**, then **D**

The run will continue in the background. Budget: ~$24-30 for 8-10hr on 1×H100 at $3/hr.

---

## Fallback: QUANT_ONLY sweep on old checkpoint (if Step 5 failed)

Run a calibration sweep on the existing Mar 25 checkpoint (produces sweep data, not SP8192 data):

```bash
QUANT_ONLY_CHECKPOINT=/workspace/parameter-golf/final_model.pt \
DATA_DIR=/workspace/parameter-golf/data \
CALIB_SPLIT_BY_MODULE=1 \
CALIB_ATTN_BATCHES=128 \
CALIB_MLP_BATCHES=64 \
nohup python train_gpt_sp8192_opt.py \
> logs/quant_only_calib_sweep_$(date +%Y%m%d_%H%M).log 2>&1 &

echo $! > logs/quant_only_calib_sweep.pid
```

---

## 3-seed record run (after overnight produces a promising BPB)

If the overnight run shows BPB improvement vs SP8192 base (1.08563), run the 3-seed record attempt:

```bash
# Edit run_record_3seed.sh to point at train_gpt_sp8192_opt.py first
# Then:
SCHEDULE="2.0,2.5,3.0,3.5,4.0,4.5,4.5,4.0,3.5,3.0,2.5"
for SEED in 42 314 999; do
QK_GAIN_INIT_SCHEDULE="${SCHEDULE}" \
SEED=${SEED} \
DATA_DIR=/workspace/parameter-golf/data \
python train_gpt_sp8192_opt.py \
2>&1 | tee logs/sp8192_opt_seed${SEED}_$(date +%Y%m%d_%H%M).log
done
```

Statistical bar: must beat 1.08563 by ≥0.005 BPB at p<0.01 via 3-seed Welch t-test to qualify as a record.
Loading