docs: increase MLX smoke validation batch size#36
docs: increase MLX smoke validation batch size#36brendanboyle87 wants to merge 1 commit intoopenai:mainfrom
Conversation
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70 - promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63 - refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65 - update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
|
??? this is increasing val batch size?? |
Sorry if I was off base here This was based on the fact that this script is for local mlx dev. there was no intermediate output so I was trying to figure out how long validation would take. Codex gave an estimate in hours vs minutes “On this machine, a full validation with the old VAL_BATCH_SIZE=8192 is roughly a 5 to 6+ hour job. With VAL_BATCH_SIZE=524288, it is about 5 minutes. The reason is in train_gpt_mlx.py:766: validation uses VAL_BATCH_SIZE // GRAD_ACCUM_STEPS. With GRAD_ACCUM_STEPS=8 and TRAIN_SEQ_LEN=1024, 8192 means only 1024 eval tokens per batch, which is exactly 1 sequence. 524288 means 65536 eval tokens, or 64 sequences per |
…penai#143) (openai#36) * feat(dr): railway template + hourly fleet snapshot + DR runbook (refs openai#143) Anchor: phi^2 + phi^-2 = 3. Closes the 'one-click DR' gap. After a Railway-account ban, payment lapse, or project deletion, the operator can rebuild the IGLA fleet in under 15 minutes via any of three paths: A. Railway template marketplace button (railway-template.json) B. GitHub Actions workflow_dispatch (deploy-from-template.yml) C. tri-railway CLI from fleet snapshot (disaster-recovery/fleet-snapshot.json) Files added: railway-template.json Marketplace-ready template with 6 services pinned to GHCR images: - trios-mcp-public (control plane) - igla-final-seed-{42,43,44} (champion lane A, 60K steps) - trios-dwagent (auto-claim further trials) - neon-backup-r2 (hourly pg_dump to Cloudflare R2) All env vars are placeholder-substituted; no hard-coded secrets. scripts/snapshot-fleet.sh Probes Railway GraphQL for every (alias, project) tuple and emits a single fleet-snapshot.json. Verified locally against the live fleet: 29 services across 4 projects across 2 accounts. .github/workflows/fleet-snapshot.yml Hourly cron at :15 UTC (offset from the audit watchdog at :05). Runs the snapshot, commits if the file changed, otherwise no-op. Survives any single account ban because the snapshot is in git. .github/workflows/deploy-from-template.yml Operator-triggered (workflow_dispatch) DR provisioning. Reads railway-template.json, calls Railway GraphQL to create one project + N services in the chosen account_alias, then writes disaster-recovery/last-restore.json with the new UUIDs. disaster-recovery/fleet-snapshot.json Initial snapshot at this commit (29 services, 4 projects). docs/DISASTER_RECOVERY.md Full runbook: what survives a ban (table), three trigger paths, required secrets, recovery sequence in detail, cost expectations, why the template only ships 6 of the 29 services. README.md DR section reorganised — three trigger paths instead of one. Old links to restore-fleet.json / docs/DR.md remain valid via the runbook's history section. Operator action items (not blocking the merge): 1. https://github.com/settings/personal-access-tokens — rotate the PAT shared in chat (out of caution); update TRIOS_REPO_PAT. 2. https://railway.com/template/new — publish railway-template.json to the marketplace; this gives the README its real button URL. 3. https://dash.cloudflare.com/r2 — create the bucket 'igla-ledger-backups'; set R2_* secrets in Actions Secrets. Refs trios#143 / openai#16. * fix(dr): rewrite snapshot in Rust as 'tri-railway snapshot fleet' (L1) CI flagged the original scripts/snapshot-fleet.sh under L1 — Rust only, zero shell scripts. Replace it with a first-class subcommand: tri-railway snapshot fleet \ --out disaster-recovery/fleet-snapshot.json \ --account 'alias=acc1,token_env=...,project_env=...,label=...' \ --email 'acc1=user@host' The subcommand reads tokens and project UUIDs from process env (so GitHub Actions can wire secrets via job-level env: blocks), probes Railway GraphQL via the existing transport client, and writes the exact same JSON the shell script produced. Tokens are NEVER recorded in the snapshot — only the env-var name is kept under .token_secret. fleet-snapshot.yml updated to call the binary instead of bash. Locally verified against the live fleet: 29 services across 4 projects across 2 accounts, byte-identical output to the previous shell version. Refs trios#143 / openai#16. --------- Co-authored-by: Perplexity Computer <computer@perplexity.ai>
Summary
VAL_BATCH_SIZE=524288Why
The default validation batch size setting in the README trial run takes a very long time on a local Mac run for an M4 Max Mac Studio with 128GB, so this raises the documented MLX smoke-test value to a more practical local setting.