Skip to content

Commit b7bf52e

Browse files
Dhruv Jatkarclaude
andcommitted
Update AGENTS.md: new baseline target, single-agent protocol
- Target to beat: 1.0781 BPB (PR openai#672, TTT_EPOCHS=30 Cosine TTT) - Add single-agent protocol section - Mark crontab auto-submitter as non-functional - Add operational lessons from March 2026 - Update preferred source script to PR672 baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0c0ea98 commit b7bf52e

1 file changed

Lines changed: 243 additions & 0 deletions

File tree

AGENTS.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# AGENTS.md
2+
3+
## Codex Agent Instructions
4+
5+
This file is the Codex-specific operating manual for this repo.
6+
7+
- `PLANS.md` is the active execution queue and status board.
8+
- `plan.md` is the long-form research notebook and experiment history.
9+
- `attempts/results.tsv` is the append-only run log.
10+
11+
Start every session by reading all three.
12+
13+
## Mission
14+
15+
Beat the current 10-minute / 16MB SOTA BPB and keep iterating until you do.
16+
17+
- Target to beat: **1.0781 BPB** (PR #672, TTT_EPOCHS=30 Cosine TTT)
18+
- Current merged SOTA on `2026-03-24`: **1.1194 BPB**
19+
- Stock baseline: **1.2304 BPB**
20+
- Stretch target: **0.6 BPB**
21+
22+
Do not stop at one clean experiment. Keep stacking wins, validating them, and updating the plan.
23+
24+
## Critical Repo Rules
25+
26+
- Work only in the fork `dhruvjatkar/parameter-golf`.
27+
- Never open PRs, push, or otherwise target `openai/parameter-golf`.
28+
- Never modify repo-root `train_gpt.py` or `train_gpt_mlx.py`.
29+
- Always copy the best current training script into a new attempt folder before editing.
30+
- Always run a concurrent baseline alongside every experiment on the same GPU type for the same duration using the unmodified SOTA script.
31+
- Always write `hypothesis.md` before launching a run.
32+
- Always append results to `attempts/results.tsv`, including failures.
33+
- Never delete failed attempts. Mark them `DISCARDED` and move on.
34+
- On Explorer, never cancel jobs you did not start. Only manage or cancel job IDs submitted by the current Codex session.
35+
36+
## Single-Agent Protocol
37+
38+
- Only ONE agent may operate against the Explorer cluster at a time.
39+
- Before starting work, check PLANS.md for an active agent session marker.
40+
- If another agent is active, coordinate with the user before proceeding.
41+
- At session start, write your agent ID and start time to PLANS.md.
42+
- At session end, clear the active agent marker and update handoff notes.
43+
44+
## Competition Rules
45+
46+
Check these before implementation, after implementation, and before submission.
47+
48+
1. Artifact must be `<= 16,000,000` bytes total.
49+
2. Training must finish within 10 minutes on `8xH100 SXM`.
50+
3. Evaluation must finish within 10 minutes on `8xH100 SXM`.
51+
4. No external downloads or network calls during evaluation.
52+
5. No training on validation data before evaluating it. Legal TTT only on tokens already scored.
53+
6. Do not smuggle extra compute through custom libraries.
54+
7. Do not brute-force seeds or otherwise game variance.
55+
8. Record claims must beat SOTA by at least `0.005` nats with statistical significance, typically 3 seeds.
56+
9. Tokenizer changes require proof that `val_bpb` is still computed correctly.
57+
10. Final competition code must live in a single `train_gpt.py` that runs from the records folder.
58+
59+
## Codex Workflow
60+
61+
1. Read `PLANS.md`, `plan.md`, and `attempts/results.tsv`.
62+
2. Pick the highest-priority untried direction that is not blocked or illegal.
63+
3. Create `attempts/YYYY-MM-DD_ShortName/`.
64+
4. Copy the current best legal SOTA script into that folder.
65+
5. Implement the change in the copied file only.
66+
6. Make every new feature toggleable through an env var for clean A/B tests.
67+
7. Write `hypothesis.md` before any launch script is submitted.
68+
8. Run an adversarial code review before submission.
69+
9. Launch experiment and baseline concurrently.
70+
10. Collect logs, record BPB and delta, and update `attempts/results.tsv`.
71+
11. Update both `plan.md` and `PLANS.md` with results, blockers, and next steps.
72+
73+
Submission uses the job backlog (see below). Write SLURM scripts, place them in
74+
`job_backlog/pending/`, rsync to Explorer, then immediately start the next research direction.
75+
Never wait for the queue to clear before beginning new work.
76+
77+
## Code Review Standard
78+
79+
Every experiment must be reviewed as if it is broken until proven otherwise.
80+
81+
Check for:
82+
83+
- Wrong shapes, wrong axes, or silent broadcasting mistakes
84+
- Bad gradient flow, missing `.detach()`, or dead code paths
85+
- Dtype and numerical stability issues
86+
- Flash-attention / SDPA fallback mismatches
87+
- Env vars that do not match the code path they are supposed to toggle
88+
- Evaluation-time legality issues, especially TTT and GPTQ calibration
89+
- Baselines accidentally pointing at modified attempt copies
90+
- Mismatch between `hypothesis.md` and the actual implementation
91+
92+
If an experiment underperforms, review it again before discarding the idea.
93+
94+
## Cluster Defaults
95+
96+
- SSH target: `ssh explorer` as user `d.jatkar`
97+
- Cluster repo: `/projects/Sontag_Lab_Storage/parameter-golf/`
98+
- Environment: `/projects/Sontag_Lab_Storage/parameter-golf-env/`
99+
- Dataset: `./data/datasets/fineweb10B_sp1024/`
100+
- Never write caches to `~/`
101+
102+
Every SLURM script should include:
103+
104+
```bash
105+
source /etc/profile.d/modules.sh
106+
module load python/3.13.5
107+
module load cuda/12.8.0
108+
export TRITON_CACHE_DIR=/projects/Sontag_Lab_Storage/.triton_cache
109+
export HF_HOME=/projects/Sontag_Lab_Storage/.hf_cache
110+
export TORCH_HOME=/projects/Sontag_Lab_Storage/.torch_cache
111+
export XDG_CACHE_HOME=/projects/Sontag_Lab_Storage/.xdg_cache
112+
source /projects/Sontag_Lab_Storage/parameter-golf-env/bin/activate
113+
```
114+
115+
## GPU Escalation Path
116+
117+
1. `1xA100` for quick screens
118+
2. `1xH200` to confirm wins
119+
3. `8xH200` for leaderboard-equivalent validation
120+
121+
Only move to `8xH200` after a clear 1-GPU improvement.
122+
123+
## Attempt Protocol
124+
125+
Use this layout:
126+
127+
```text
128+
attempts/
129+
results.tsv
130+
YYYY-MM-DD_ShortName/
131+
train_gpt.py
132+
hypothesis.md
133+
run_experiment.sh
134+
run_baseline.sh
135+
experiment.log
136+
baseline.log
137+
submission.json
138+
```
139+
140+
Preferred source script at the moment:
141+
142+
`records/track_10min_16mb/PR672_CosineTTT30_1.0781/train_gpt.py`
143+
144+
Record each run in `attempts/results.tsv` as:
145+
146+
```text
147+
date name bpb baseline_bpb delta gpu status description
148+
```
149+
150+
Status values:
151+
152+
- `PENDING` for staged work that is not yet `sbatch`-submitted
153+
- `SUBMITTED` for work with live Slurm job IDs that has been submitted but not finished
154+
- `RUNNING` for jobs currently executing
155+
- `FAILED` / `TIMEOUT` for infrastructure or runtime failures
156+
- `KEEP` for real wins worth stacking
157+
- `DISCARDED` for confirmed non-wins
158+
- `VALIDATING` for promising multi-seed or 8xH200 follow-up
159+
- `RECORD` for validated SOTA-beating runs
160+
161+
## Search Policy
162+
163+
Search online when:
164+
165+
- `PLANS.md` and `plan.md` are exhausted or stalling
166+
- A referenced paper or PR needs implementation details
167+
- You need to verify the live upstream leaderboard
168+
- A technique is promising but underspecified locally
169+
170+
## Job Backlog System
171+
172+
**WARNING: The crontab auto-submitter does NOT work** (PAM blocks crontab on Explorer). Use manual `sbatch` for all job submissions.
173+
174+
`job_backlog/` is a self-service SLURM queue directory structure.
175+
176+
```
177+
job_backlog/
178+
submit_backlog.sh # cron script on Explorer
179+
pending/ # agents drop .slurm scripts here, then rsync
180+
submitted/ # moved here after sbatch succeeds
181+
failed/ # moved here if sbatch fails
182+
submit.log # full submission history
183+
```
184+
185+
### One-time cron setup [NON-FUNCTIONAL — PAM blocks crontab]
186+
187+
```bash
188+
ssh explorer "(crontab -l 2>/dev/null; echo '*/5 * * * * /projects/Sontag_Lab_Storage/parameter-golf/job_backlog/submit_backlog.sh') | crontab -"
189+
```
190+
191+
Verify: `ssh explorer "crontab -l | grep submit_backlog"`
192+
193+
### SLURM script naming convention
194+
195+
Always prefix with a timestamp so the submitter processes scripts in order:
196+
```
197+
YYYY-MM-DD_HH-MM-SS_<attempt_name>_experiment.slurm
198+
YYYY-MM-DD_HH-MM-SS_<attempt_name>_baseline.slurm
199+
```
200+
201+
### Submission workflow
202+
203+
1. Write SLURM scripts into `job_backlog/pending/`
204+
2. Rsync to Explorer:
205+
```bash
206+
rsync -avz job_backlog/pending/ explorer:/projects/Sontag_Lab_Storage/parameter-golf/job_backlog/pending/
207+
```
208+
3. **Immediately pivot to a completely different research direction.** Do not wait.
209+
- Read `attempts/results.tsv` — what has worked best so far?
210+
- Search online for new approaches not yet in `plan.md`
211+
- Pick something architecturally distinct from what is now queued
212+
- Implement, review, queue that too
213+
4. Collect results later by rsyncing `job_backlog/submitted/` and attempt `.log` files
214+
215+
### The pivot rule
216+
217+
Once scripts are in the backlog, treat them as running and move on. Multiple agents can all
218+
rsync to `pending/` independently — the submitter serialises submission safely.
219+
220+
### Session handoff
221+
222+
Before ending any session, update:
223+
- `attempts/results.tsv` — all newly completed results
224+
- `plan.md` — research findings, new hypotheses, priorities
225+
- `PLANS.md` — which experiments are staged/running and what direction comes next
226+
227+
**Goal**: the next agent must be able to read PLANS.md and start new work within 2 minutes.
228+
229+
## Operational Lessons
230+
231+
- Manual sbatch is the only reliable submission method on Explorer.
232+
- Use gpu-short partition for 1-GPU screens under 2 hours.
233+
- Always use absolute paths for DATA_PATH and TOKENIZER_PATH in SLURM scripts.
234+
- PYTHONPATH must include attempts/_compat/ for the flash_attn_interface shim.
235+
- Only final int6 sliding-window BPB is authoritative — pre-EMA snapshots are misleading.
236+
- Non-interactive SSH: wrap commands in `/bin/bash --noprofile --norc -lc '...'`.
237+
238+
## Remotes
239+
240+
- `origin`: `https://github.com/dhruvjatkar/parameter-golf.git`
241+
- `upstream`: `https://github.com/openai/parameter-golf.git`
242+
243+
Never target `upstream` for pushes or PRs.

0 commit comments

Comments
 (0)