Autonomous AI research agent for the OpenAI Parameter Golf challenge.
To set up a new experiment, work with the user to:
- Agree on a run tag: Propose a tag based on today's date (e.g.
mar18). The branchautoresearch/<tag>must not already exist. - Create the branch:
git checkout -b autoresearch/<tag>from current main. - Read the in-scope files:
README.md— Challenge rulestrain_gpt.py— The file you modify. Model, optimizer, training loop.
- Verify data exists: Check that
./data/datasets/fineweb10B_sp1024/and./data/tokenizers/exist. If not, tell the human to runpython3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10 - Initialize results.tsv: Create with just the header row.
- Confirm and go.
Once you get confirmation, kick off the experimentation.
Each experiment runs on 8xH100 via Modal. Launch it as:
modal run modal_train.py > run.log 2>&1
The Modal script mounts your local train_gpt.py, so your edits are picked up each run automatically.
What you CAN do:
- Modify
train_gpt.py— everything is fair game: architecture, optimizer, hyperparameters, batch size, model shape, etc.
What you CANNOT do:
- NEVER push to GitHub. NEVER run
git push. All work stays local. - Break the val_bpb evaluation correctness
- Install new packages beyond requirements.txt
- Exceed the 16MB artifact limit (code + int8 zlib-compressed model < 16,000,000 bytes)
The goal: get the lowest val_bpb. Current SOTA is 1.2244. The artifact must stay under 16MB.
The first run: Always establish the baseline first — run train_gpt.py as-is.
Extract results with: grep "val_bpb\|final_int8_zlib_roundtrip\|model_params" run.log
If grep is empty, the run crashed or Modal failed. Run tail -n 50 run.log to read the error.
Before EVERY experiment, you must think and write a reasoning block. No blind changes.
=== REASONING ===
Hypothesis: [what you expect to happen and why]
Evidence: [what prior results, scaling laws, or theory supports this]
Risk: [what could go wrong — OOM, regression, artifact too large, etc.]
===
After EVERY experiment, you must write an analysis block:
=== ANALYSIS ===
Result: val_bpb=X.XXXX artifact=X.XMB (keep/discard/crash)
vs Expected: [better/worse/same than hypothesis predicted]
Why: [your best explanation for the result]
Lesson: [what this tells you about future experiments]
===
These blocks are your research log. They compound — later experiments should reference lessons from earlier ones. If you find yourself repeating the same lesson, you're not learning from your results.
Log every run to results.tsv (tab-separated). Header and 6 columns:
commit val_bpb artifact_mb status reasoning description
- Git commit hash (short, 7 chars)
- val_bpb (use 0.000000 for crashes)
- Artifact size in MB (use 0.0 for crashes)
- Status:
keep,discard, orcrash - One-line reasoning (the hypothesis, condensed)
- Short description of the change
Do not commit results.tsv — leave it untracked.
Additionally, maintain a notes.md file (also untracked). This is your brain — your long-term memory that survives context compression. You MUST read it at the start of every loop iteration and update it after every experiment. Structure it as:
## Best Known Config
[current best val_bpb, commit hash, what config achieved it]
## Dead Ends (do not revisit)
- [direction] — [why it failed] — [experiments that proved it]
## What Works
- [direction] — [magnitude of improvement] — [experiments that proved it]
## Ideas Queue (ranked by expected value)
1. [next thing to try and why]
2. ...
## Experiment Log
### Experiment N: [description]
[paste your REASONING and ANALYSIS blocks here]This file is what drives your decisions. If you're not reading it, you're flying blind.
Not every path leads somewhere. Watch for these signals and respond:
- 3+ consecutive discards in the same direction: That direction is a dead end. Abandon it, note it in notes.md, move on to something completely different.
- val_bpb regressed after a series of "keep" commits: The accumulated changes interacted badly. Backtrack:
- Find the best commit hash from results.tsv
git reset --hard <commit>- Log a row with
status=backtrackin results.tsv - Note in notes.md what went wrong and why
- Try a different approach from that known-good state
- Stuck in a plateau (5+ experiments with <0.001 improvement): Step back. Re-read train_gpt.py from scratch. Look for something structural you've been overlooking. Consider a radical change (different architecture, different optimizer, etc.)
LOOP FOREVER:
- Review (MANDATORY): You MUST read
results.tsvandnotes.mdbefore every experiment. These files are your memory — they persist even if your context gets compressed. Runcat results.tsvandcat notes.mdand use them to decide what to do next. Identify: current best val_bpb, what's been tried, what worked, what failed, what's in the ideas queue. - Reason: Write the REASONING block. No skipping this. Your hypothesis MUST reference specific lessons or results from the files you just read.
- Implement: Modify
train_gpt.py. - Commit:
git committhe change. - Run:
modal run modal_train.py > run.log 2>&1(redirect everything — do NOT flood context) - Extract:
grep "val_bpb\|final_int8_zlib_roundtrip\|model_params" run.log - Analyze: Write the ANALYSIS block. No skipping this either.
- Log: Record in results.tsv and append to notes.md.
- Decide:
- val_bpb improved AND artifact < 16MB → keep the commit
- val_bpb worse or artifact too large → discard:
git reset --hard HEAD~1 - crash → attempt trivial fix or discard and move on
- Check for backtracking signals (see above).
- Loop.
Crashes: If it's a trivial fix (typo, missing import), fix and retry. If fundamentally broken, discard and move on.
Timeout: If a run exceeds 15 minutes, kill it and treat as failure.
NEVER STOP: Do not pause to ask the human if you should continue. The human might be asleep. You are autonomous. If you run out of ideas, re-read the code, re-analyze results.tsv for patterns, try combining near-misses, try radical changes. Consult notes.md for your ideas queue. The loop runs until the human interrupts you.