Update AGENTS.md: new baseline target, single-agent protocol

Dhruv Jatkar · claude · Dhruv Jatkar · commit b7bf52ea96bf · 2026-03-25T01:31:34.000-04:00
- Target to beat: 1.0781 BPB (PR openai#672, TTT_EPOCHS=30 Cosine TTT) - Add single-agent protocol section - Mark crontab auto-submitter as non-functional - Add operational lessons from March 2026 - Update preferred source script to PR672 baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,243 @@
+# AGENTS.md
+
+## Codex Agent Instructions
+
+This file is the Codex-specific operating manual for this repo.
+
+- `PLANS.md` is the active execution queue and status board.
+- `plan.md` is the long-form research notebook and experiment history.
+- `attempts/results.tsv` is the append-only run log.
+
+Start every session by reading all three.
+
+## Mission
+
+Beat the current 10-minute / 16MB SOTA BPB and keep iterating until you do.
+
+- Target to beat: **1.0781 BPB** (PR #672, TTT_EPOCHS=30 Cosine TTT)
+- Current merged SOTA on `2026-03-24`: **1.1194 BPB**
+- Stock baseline: **1.2304 BPB**
+- Stretch target: **0.6 BPB**
+
+Do not stop at one clean experiment. Keep stacking wins, validating them, and updating the plan.
+
+## Critical Repo Rules
+
+- Work only in the fork `dhruvjatkar/parameter-golf`.
+- Never open PRs, push, or otherwise target `openai/parameter-golf`.
+- Never modify repo-root `train_gpt.py` or `train_gpt_mlx.py`.
+- Always copy the best current training script into a new attempt folder before editing.
+- Always run a concurrent baseline alongside every experiment on the same GPU type for the same duration using the unmodified SOTA script.
+- Always write `hypothesis.md` before launching a run.
+- Always append results to `attempts/results.tsv`, including failures.
+- Never delete failed attempts. Mark them `DISCARDED` and move on.
+- On Explorer, never cancel jobs you did not start. Only manage or cancel job IDs submitted by the current Codex session.
+
+## Single-Agent Protocol
+
+- Only ONE agent may operate against the Explorer cluster at a time.
+- Before starting work, check PLANS.md for an active agent session marker.
+- If another agent is active, coordinate with the user before proceeding.
+- At session start, write your agent ID and start time to PLANS.md.
+- At session end, clear the active agent marker and update handoff notes.
+
+## Competition Rules
+
+Check these before implementation, after implementation, and before submission.
+
+1. Artifact must be `<= 16,000,000` bytes total.
+2. Training must finish within 10 minutes on `8xH100 SXM`.
+3. Evaluation must finish within 10 minutes on `8xH100 SXM`.
+4. No external downloads or network calls during evaluation.
+5. No training on validation data before evaluating it. Legal TTT only on tokens already scored.
+6. Do not smuggle extra compute through custom libraries.
+7. Do not brute-force seeds or otherwise game variance.
+8. Record claims must beat SOTA by at least `0.005` nats with statistical significance, typically 3 seeds.
+9. Tokenizer changes require proof that `val_bpb` is still computed correctly.
+10. Final competition code must live in a single `train_gpt.py` that runs from the records folder.
+
+## Codex Workflow
+
+1. Read `PLANS.md`, `plan.md`, and `attempts/results.tsv`.
+2. Pick the highest-priority untried direction that is not blocked or illegal.
+3. Create `attempts/YYYY-MM-DD_ShortName/`.
+4. Copy the current best legal SOTA script into that folder.
+5. Implement the change in the copied file only.
+6. Make every new feature toggleable through an env var for clean A/B tests.
+7. Write `hypothesis.md` before any launch script is submitted.
+8. Run an adversarial code review before submission.
+9. Launch experiment and baseline concurrently.
+10. Collect logs, record BPB and delta, and update `attempts/results.tsv`.
+11. Update both `plan.md` and `PLANS.md` with results, blockers, and next steps.
+
+Submission uses the job backlog (see below). Write SLURM scripts, place them in
+`job_backlog/pending/`, rsync to Explorer, then immediately start the next research direction.
+Never wait for the queue to clear before beginning new work.
+
+## Code Review Standard
+
+Every experiment must be reviewed as if it is broken until proven otherwise.
+
+Check for:
+
+- Wrong shapes, wrong axes, or silent broadcasting mistakes
+- Bad gradient flow, missing `.detach()`, or dead code paths
+- Dtype and numerical stability issues
+- Flash-attention / SDPA fallback mismatches
+- Env vars that do not match the code path they are supposed to toggle
+- Evaluation-time legality issues, especially TTT and GPTQ calibration
+- Baselines accidentally pointing at modified attempt copies
+- Mismatch between `hypothesis.md` and the actual implementation
+
+If an experiment underperforms, review it again before discarding the idea.
+
+## Cluster Defaults
+
+- SSH target: `ssh explorer` as user `d.jatkar`
+- Cluster repo: `/projects/Sontag_Lab_Storage/parameter-golf/`
+- Environment: `/projects/Sontag_Lab_Storage/parameter-golf-env/`
+- Dataset: `./data/datasets/fineweb10B_sp1024/`
+- Never write caches to `~/`
+
+Every SLURM script should include:
+
+```bash
+source /etc/profile.d/modules.sh
+module load python/3.13.5
+module load cuda/12.8.0
+export TRITON_CACHE_DIR=/projects/Sontag_Lab_Storage/.triton_cache
+export HF_HOME=/projects/Sontag_Lab_Storage/.hf_cache
+export TORCH_HOME=/projects/Sontag_Lab_Storage/.torch_cache
+export XDG_CACHE_HOME=/projects/Sontag_Lab_Storage/.xdg_cache
+source /projects/Sontag_Lab_Storage/parameter-golf-env/bin/activate
+```
+
+## GPU Escalation Path
+
+1. `1xA100` for quick screens
+2. `1xH200` to confirm wins
+3. `8xH200` for leaderboard-equivalent validation
+
+Only move to `8xH200` after a clear 1-GPU improvement.
+
+## Attempt Protocol
+
+Use this layout:
+
+```text
+attempts/
+  results.tsv
+  YYYY-MM-DD_ShortName/
+    train_gpt.py
+    hypothesis.md
+    run_experiment.sh
+    run_baseline.sh
+    experiment.log
+    baseline.log
+    submission.json
+```
+
+Preferred source script at the moment:
+
+`records/track_10min_16mb/PR672_CosineTTT30_1.0781/train_gpt.py`
+
+Record each run in `attempts/results.tsv` as:
+
+```text
+date	name	bpb	baseline_bpb	delta	gpu	status	description
+```
+
+Status values:
+
+- `PENDING` for staged work that is not yet `sbatch`-submitted
+- `SUBMITTED` for work with live Slurm job IDs that has been submitted but not finished
+- `RUNNING` for jobs currently executing
+- `FAILED` / `TIMEOUT` for infrastructure or runtime failures
+- `KEEP` for real wins worth stacking
+- `DISCARDED` for confirmed non-wins
+- `VALIDATING` for promising multi-seed or 8xH200 follow-up
+- `RECORD` for validated SOTA-beating runs
+
+## Search Policy
+
+Search online when:
+
+- `PLANS.md` and `plan.md` are exhausted or stalling
+- A referenced paper or PR needs implementation details
+- You need to verify the live upstream leaderboard
+- A technique is promising but underspecified locally
+
+## Job Backlog System
+
+**WARNING: The crontab auto-submitter does NOT work** (PAM blocks crontab on Explorer). Use manual `sbatch` for all job submissions.
+
+`job_backlog/` is a self-service SLURM queue directory structure.
+
+```
+job_backlog/
+  submit_backlog.sh   # cron script on Explorer
+  pending/            # agents drop .slurm scripts here, then rsync
+  submitted/          # moved here after sbatch succeeds
+  failed/             # moved here if sbatch fails
+  submit.log          # full submission history
+```
+
+### One-time cron setup [NON-FUNCTIONAL — PAM blocks crontab]
+
+```bash
+ssh explorer "(crontab -l 2>/dev/null; echo '*/5 * * * * /projects/Sontag_Lab_Storage/parameter-golf/job_backlog/submit_backlog.sh') | crontab -"
+```
+
+Verify: `ssh explorer "crontab -l | grep submit_backlog"`
+
+### SLURM script naming convention
+
+Always prefix with a timestamp so the submitter processes scripts in order:
+```
+YYYY-MM-DD_HH-MM-SS_<attempt_name>_experiment.slurm
+YYYY-MM-DD_HH-MM-SS_<attempt_name>_baseline.slurm
+```
+
+### Submission workflow
+
+1. Write SLURM scripts into `job_backlog/pending/`
+2. Rsync to Explorer:
+   ```bash
+   rsync -avz job_backlog/pending/ explorer:/projects/Sontag_Lab_Storage/parameter-golf/job_backlog/pending/
+   ```
+3. **Immediately pivot to a completely different research direction.** Do not wait.
+   - Read `attempts/results.tsv` — what has worked best so far?
+   - Search online for new approaches not yet in `plan.md`
+   - Pick something architecturally distinct from what is now queued
+   - Implement, review, queue that too
+4. Collect results later by rsyncing `job_backlog/submitted/` and attempt `.log` files
+
+### The pivot rule
+
+Once scripts are in the backlog, treat them as running and move on. Multiple agents can all
+rsync to `pending/` independently — the submitter serialises submission safely.
+
+### Session handoff
+
+Before ending any session, update:
+- `attempts/results.tsv` — all newly completed results
+- `plan.md` — research findings, new hypotheses, priorities
+- `PLANS.md` — which experiments are staged/running and what direction comes next
+
+**Goal**: the next agent must be able to read PLANS.md and start new work within 2 minutes.
+
+## Operational Lessons
+
+- Manual sbatch is the only reliable submission method on Explorer.
+- Use gpu-short partition for 1-GPU screens under 2 hours.
+- Always use absolute paths for DATA_PATH and TOKENIZER_PATH in SLURM scripts.
+- PYTHONPATH must include attempts/_compat/ for the flash_attn_interface shim.
+- Only final int6 sliding-window BPB is authoritative — pre-EMA snapshots are misleading.
+- Non-interactive SSH: wrap commands in `/bin/bash --noprofile --norc -lc '...'`.
+
+## Remotes
+
+- `origin`: `https://github.com/dhruvjatkar/parameter-golf.git`
+- `upstream`: `https://github.com/openai/parameter-golf.git`
+
+Never target `upstream` for pushes or PRs.