|
| 1 | +# AGENTS.md |
| 2 | + |
| 3 | +## Codex Agent Instructions |
| 4 | + |
| 5 | +This file is the Codex-specific operating manual for this repo. |
| 6 | + |
| 7 | +- `PLANS.md` is the active execution queue and status board. |
| 8 | +- `plan.md` is the long-form research notebook and experiment history. |
| 9 | +- `attempts/results.tsv` is the append-only run log. |
| 10 | + |
| 11 | +Start every session by reading all three. |
| 12 | + |
| 13 | +## Mission |
| 14 | + |
| 15 | +Beat the current 10-minute / 16MB SOTA BPB and keep iterating until you do. |
| 16 | + |
| 17 | +- Target to beat: **1.0781 BPB** (PR #672, TTT_EPOCHS=30 Cosine TTT) |
| 18 | +- Current merged SOTA on `2026-03-24`: **1.1194 BPB** |
| 19 | +- Stock baseline: **1.2304 BPB** |
| 20 | +- Stretch target: **0.6 BPB** |
| 21 | + |
| 22 | +Do not stop at one clean experiment. Keep stacking wins, validating them, and updating the plan. |
| 23 | + |
| 24 | +## Critical Repo Rules |
| 25 | + |
| 26 | +- Work only in the fork `dhruvjatkar/parameter-golf`. |
| 27 | +- Never open PRs, push, or otherwise target `openai/parameter-golf`. |
| 28 | +- Never modify repo-root `train_gpt.py` or `train_gpt_mlx.py`. |
| 29 | +- Always copy the best current training script into a new attempt folder before editing. |
| 30 | +- Always run a concurrent baseline alongside every experiment on the same GPU type for the same duration using the unmodified SOTA script. |
| 31 | +- Always write `hypothesis.md` before launching a run. |
| 32 | +- Always append results to `attempts/results.tsv`, including failures. |
| 33 | +- Never delete failed attempts. Mark them `DISCARDED` and move on. |
| 34 | +- On Explorer, never cancel jobs you did not start. Only manage or cancel job IDs submitted by the current Codex session. |
| 35 | + |
| 36 | +## Single-Agent Protocol |
| 37 | + |
| 38 | +- Only ONE agent may operate against the Explorer cluster at a time. |
| 39 | +- Before starting work, check PLANS.md for an active agent session marker. |
| 40 | +- If another agent is active, coordinate with the user before proceeding. |
| 41 | +- At session start, write your agent ID and start time to PLANS.md. |
| 42 | +- At session end, clear the active agent marker and update handoff notes. |
| 43 | + |
| 44 | +## Competition Rules |
| 45 | + |
| 46 | +Check these before implementation, after implementation, and before submission. |
| 47 | + |
| 48 | +1. Artifact must be `<= 16,000,000` bytes total. |
| 49 | +2. Training must finish within 10 minutes on `8xH100 SXM`. |
| 50 | +3. Evaluation must finish within 10 minutes on `8xH100 SXM`. |
| 51 | +4. No external downloads or network calls during evaluation. |
| 52 | +5. No training on validation data before evaluating it. Legal TTT only on tokens already scored. |
| 53 | +6. Do not smuggle extra compute through custom libraries. |
| 54 | +7. Do not brute-force seeds or otherwise game variance. |
| 55 | +8. Record claims must beat SOTA by at least `0.005` nats with statistical significance, typically 3 seeds. |
| 56 | +9. Tokenizer changes require proof that `val_bpb` is still computed correctly. |
| 57 | +10. Final competition code must live in a single `train_gpt.py` that runs from the records folder. |
| 58 | + |
| 59 | +## Codex Workflow |
| 60 | + |
| 61 | +1. Read `PLANS.md`, `plan.md`, and `attempts/results.tsv`. |
| 62 | +2. Pick the highest-priority untried direction that is not blocked or illegal. |
| 63 | +3. Create `attempts/YYYY-MM-DD_ShortName/`. |
| 64 | +4. Copy the current best legal SOTA script into that folder. |
| 65 | +5. Implement the change in the copied file only. |
| 66 | +6. Make every new feature toggleable through an env var for clean A/B tests. |
| 67 | +7. Write `hypothesis.md` before any launch script is submitted. |
| 68 | +8. Run an adversarial code review before submission. |
| 69 | +9. Launch experiment and baseline concurrently. |
| 70 | +10. Collect logs, record BPB and delta, and update `attempts/results.tsv`. |
| 71 | +11. Update both `plan.md` and `PLANS.md` with results, blockers, and next steps. |
| 72 | + |
| 73 | +Submission uses the job backlog (see below). Write SLURM scripts, place them in |
| 74 | +`job_backlog/pending/`, rsync to Explorer, then immediately start the next research direction. |
| 75 | +Never wait for the queue to clear before beginning new work. |
| 76 | + |
| 77 | +## Code Review Standard |
| 78 | + |
| 79 | +Every experiment must be reviewed as if it is broken until proven otherwise. |
| 80 | + |
| 81 | +Check for: |
| 82 | + |
| 83 | +- Wrong shapes, wrong axes, or silent broadcasting mistakes |
| 84 | +- Bad gradient flow, missing `.detach()`, or dead code paths |
| 85 | +- Dtype and numerical stability issues |
| 86 | +- Flash-attention / SDPA fallback mismatches |
| 87 | +- Env vars that do not match the code path they are supposed to toggle |
| 88 | +- Evaluation-time legality issues, especially TTT and GPTQ calibration |
| 89 | +- Baselines accidentally pointing at modified attempt copies |
| 90 | +- Mismatch between `hypothesis.md` and the actual implementation |
| 91 | + |
| 92 | +If an experiment underperforms, review it again before discarding the idea. |
| 93 | + |
| 94 | +## Cluster Defaults |
| 95 | + |
| 96 | +- SSH target: `ssh explorer` as user `d.jatkar` |
| 97 | +- Cluster repo: `/projects/Sontag_Lab_Storage/parameter-golf/` |
| 98 | +- Environment: `/projects/Sontag_Lab_Storage/parameter-golf-env/` |
| 99 | +- Dataset: `./data/datasets/fineweb10B_sp1024/` |
| 100 | +- Never write caches to `~/` |
| 101 | + |
| 102 | +Every SLURM script should include: |
| 103 | + |
| 104 | +```bash |
| 105 | +source /etc/profile.d/modules.sh |
| 106 | +module load python/3.13.5 |
| 107 | +module load cuda/12.8.0 |
| 108 | +export TRITON_CACHE_DIR=/projects/Sontag_Lab_Storage/.triton_cache |
| 109 | +export HF_HOME=/projects/Sontag_Lab_Storage/.hf_cache |
| 110 | +export TORCH_HOME=/projects/Sontag_Lab_Storage/.torch_cache |
| 111 | +export XDG_CACHE_HOME=/projects/Sontag_Lab_Storage/.xdg_cache |
| 112 | +source /projects/Sontag_Lab_Storage/parameter-golf-env/bin/activate |
| 113 | +``` |
| 114 | + |
| 115 | +## GPU Escalation Path |
| 116 | + |
| 117 | +1. `1xA100` for quick screens |
| 118 | +2. `1xH200` to confirm wins |
| 119 | +3. `8xH200` for leaderboard-equivalent validation |
| 120 | + |
| 121 | +Only move to `8xH200` after a clear 1-GPU improvement. |
| 122 | + |
| 123 | +## Attempt Protocol |
| 124 | + |
| 125 | +Use this layout: |
| 126 | + |
| 127 | +```text |
| 128 | +attempts/ |
| 129 | + results.tsv |
| 130 | + YYYY-MM-DD_ShortName/ |
| 131 | + train_gpt.py |
| 132 | + hypothesis.md |
| 133 | + run_experiment.sh |
| 134 | + run_baseline.sh |
| 135 | + experiment.log |
| 136 | + baseline.log |
| 137 | + submission.json |
| 138 | +``` |
| 139 | + |
| 140 | +Preferred source script at the moment: |
| 141 | + |
| 142 | +`records/track_10min_16mb/PR672_CosineTTT30_1.0781/train_gpt.py` |
| 143 | + |
| 144 | +Record each run in `attempts/results.tsv` as: |
| 145 | + |
| 146 | +```text |
| 147 | +date name bpb baseline_bpb delta gpu status description |
| 148 | +``` |
| 149 | + |
| 150 | +Status values: |
| 151 | + |
| 152 | +- `PENDING` for staged work that is not yet `sbatch`-submitted |
| 153 | +- `SUBMITTED` for work with live Slurm job IDs that has been submitted but not finished |
| 154 | +- `RUNNING` for jobs currently executing |
| 155 | +- `FAILED` / `TIMEOUT` for infrastructure or runtime failures |
| 156 | +- `KEEP` for real wins worth stacking |
| 157 | +- `DISCARDED` for confirmed non-wins |
| 158 | +- `VALIDATING` for promising multi-seed or 8xH200 follow-up |
| 159 | +- `RECORD` for validated SOTA-beating runs |
| 160 | + |
| 161 | +## Search Policy |
| 162 | + |
| 163 | +Search online when: |
| 164 | + |
| 165 | +- `PLANS.md` and `plan.md` are exhausted or stalling |
| 166 | +- A referenced paper or PR needs implementation details |
| 167 | +- You need to verify the live upstream leaderboard |
| 168 | +- A technique is promising but underspecified locally |
| 169 | + |
| 170 | +## Job Backlog System |
| 171 | + |
| 172 | +**WARNING: The crontab auto-submitter does NOT work** (PAM blocks crontab on Explorer). Use manual `sbatch` for all job submissions. |
| 173 | + |
| 174 | +`job_backlog/` is a self-service SLURM queue directory structure. |
| 175 | + |
| 176 | +``` |
| 177 | +job_backlog/ |
| 178 | + submit_backlog.sh # cron script on Explorer |
| 179 | + pending/ # agents drop .slurm scripts here, then rsync |
| 180 | + submitted/ # moved here after sbatch succeeds |
| 181 | + failed/ # moved here if sbatch fails |
| 182 | + submit.log # full submission history |
| 183 | +``` |
| 184 | + |
| 185 | +### One-time cron setup [NON-FUNCTIONAL — PAM blocks crontab] |
| 186 | + |
| 187 | +```bash |
| 188 | +ssh explorer "(crontab -l 2>/dev/null; echo '*/5 * * * * /projects/Sontag_Lab_Storage/parameter-golf/job_backlog/submit_backlog.sh') | crontab -" |
| 189 | +``` |
| 190 | + |
| 191 | +Verify: `ssh explorer "crontab -l | grep submit_backlog"` |
| 192 | + |
| 193 | +### SLURM script naming convention |
| 194 | + |
| 195 | +Always prefix with a timestamp so the submitter processes scripts in order: |
| 196 | +``` |
| 197 | +YYYY-MM-DD_HH-MM-SS_<attempt_name>_experiment.slurm |
| 198 | +YYYY-MM-DD_HH-MM-SS_<attempt_name>_baseline.slurm |
| 199 | +``` |
| 200 | + |
| 201 | +### Submission workflow |
| 202 | + |
| 203 | +1. Write SLURM scripts into `job_backlog/pending/` |
| 204 | +2. Rsync to Explorer: |
| 205 | + ```bash |
| 206 | + rsync -avz job_backlog/pending/ explorer:/projects/Sontag_Lab_Storage/parameter-golf/job_backlog/pending/ |
| 207 | + ``` |
| 208 | +3. **Immediately pivot to a completely different research direction.** Do not wait. |
| 209 | + - Read `attempts/results.tsv` — what has worked best so far? |
| 210 | + - Search online for new approaches not yet in `plan.md` |
| 211 | + - Pick something architecturally distinct from what is now queued |
| 212 | + - Implement, review, queue that too |
| 213 | +4. Collect results later by rsyncing `job_backlog/submitted/` and attempt `.log` files |
| 214 | + |
| 215 | +### The pivot rule |
| 216 | + |
| 217 | +Once scripts are in the backlog, treat them as running and move on. Multiple agents can all |
| 218 | +rsync to `pending/` independently — the submitter serialises submission safely. |
| 219 | + |
| 220 | +### Session handoff |
| 221 | + |
| 222 | +Before ending any session, update: |
| 223 | +- `attempts/results.tsv` — all newly completed results |
| 224 | +- `plan.md` — research findings, new hypotheses, priorities |
| 225 | +- `PLANS.md` — which experiments are staged/running and what direction comes next |
| 226 | + |
| 227 | +**Goal**: the next agent must be able to read PLANS.md and start new work within 2 minutes. |
| 228 | + |
| 229 | +## Operational Lessons |
| 230 | + |
| 231 | +- Manual sbatch is the only reliable submission method on Explorer. |
| 232 | +- Use gpu-short partition for 1-GPU screens under 2 hours. |
| 233 | +- Always use absolute paths for DATA_PATH and TOKENIZER_PATH in SLURM scripts. |
| 234 | +- PYTHONPATH must include attempts/_compat/ for the flash_attn_interface shim. |
| 235 | +- Only final int6 sliding-window BPB is authoritative — pre-EMA snapshots are misleading. |
| 236 | +- Non-interactive SSH: wrap commands in `/bin/bash --noprofile --norc -lc '...'`. |
| 237 | + |
| 238 | +## Remotes |
| 239 | + |
| 240 | +- `origin`: `https://github.com/dhruvjatkar/parameter-golf.git` |
| 241 | +- `upstream`: `https://github.com/openai/parameter-golf.git` |
| 242 | + |
| 243 | +Never target `upstream` for pushes or PRs. |
0 commit comments