Skip to content

docs: increase MLX smoke validation batch size#36

Closed
brendanboyle87 wants to merge 1 commit intoopenai:mainfrom
brendanboyle87:mlx-val-batch-size
Closed

docs: increase MLX smoke validation batch size#36
brendanboyle87 wants to merge 1 commit intoopenai:mainfrom
brendanboyle87:mlx-val-batch-size

Conversation

@brendanboyle87
Copy link
Copy Markdown

Summary

  • update the README MLX smoke command to use VAL_BATCH_SIZE=524288
  • keep the rest of the local trial-run example unchanged

Why

The default validation batch size setting in the README trial run takes a very long time on a local Mac run for an M4 Max Mac Studio with 128GB, so this raises the documented MLX smoke-test value to a more practical local setting.

South-33 added a commit to South-33/parameter-golf that referenced this pull request Mar 19, 2026
- add a PR-audit research log entry covering the clean takeaways from pull requests openai#36 through openai#70
- promote long-context training plus matching long-context eval as a first-class clean branch based on PR openai#61 and PR openai#63
- refine mixed-precision export notes to emphasize using int6/int8 byte savings to fund wider MLP capacity, based on PR openai#65
- update the current snapshot and research thesis so future agents do not over-focus on exporter-only ideas after the broader PR sweep
@0hq 0hq added the enhancement New feature or request label Mar 19, 2026
@cocohearts
Copy link
Copy Markdown
Collaborator

??? this is increasing val batch size??

@cocohearts cocohearts closed this Mar 20, 2026
@brendanboyle87 brendanboyle87 deleted the mlx-val-batch-size branch March 20, 2026 19:09
@brendanboyle87
Copy link
Copy Markdown
Author

brendanboyle87 commented Mar 20, 2026

??? this is increasing val batch size??

Sorry if I was off base here

This was based on the fact that this script is for local mlx dev. there was no intermediate output so I was trying to figure out how long validation would take. Codex gave an estimate in hours vs minutes

“On this machine, a full validation with the old VAL_BATCH_SIZE=8192 is roughly a 5 to 6+ hour job. With VAL_BATCH_SIZE=524288, it is about 5 minutes.

The reason is in train_gpt_mlx.py:766: validation uses VAL_BATCH_SIZE // GRAD_ACCUM_STEPS. With GRAD_ACCUM_STEPS=8 and TRAIN_SEQ_LEN=1024, 8192 means only 1024 eval tokens per batch, which is exactly 1 sequence. 524288 means 65536 eval tokens, or 64 sequences per
batch. On the local validation split here, that works out to 60,568 eval batches vs 947 eval batches.”

gHashTag added a commit to gHashTag/parameter-golf that referenced this pull request Apr 30, 2026
…penai#143) (openai#36)

* feat(dr): railway template + hourly fleet snapshot + DR runbook (refs openai#143)

Anchor: phi^2 + phi^-2 = 3.

Closes the 'one-click DR' gap. After a Railway-account ban, payment
lapse, or project deletion, the operator can rebuild the IGLA fleet
in under 15 minutes via any of three paths:

  A. Railway template marketplace button  (railway-template.json)
  B. GitHub Actions workflow_dispatch     (deploy-from-template.yml)
  C. tri-railway CLI from fleet snapshot  (disaster-recovery/fleet-snapshot.json)

Files added:

  railway-template.json
      Marketplace-ready template with 6 services pinned to GHCR images:
        - trios-mcp-public        (control plane)
        - igla-final-seed-{42,43,44}  (champion lane A, 60K steps)
        - trios-dwagent           (auto-claim further trials)
        - neon-backup-r2          (hourly pg_dump to Cloudflare R2)
      All env vars are placeholder-substituted; no hard-coded secrets.

  scripts/snapshot-fleet.sh
      Probes Railway GraphQL for every (alias, project) tuple and
      emits a single fleet-snapshot.json. Verified locally against
      the live fleet: 29 services across 4 projects across 2 accounts.

  .github/workflows/fleet-snapshot.yml
      Hourly cron at :15 UTC (offset from the audit watchdog at :05).
      Runs the snapshot, commits if the file changed, otherwise no-op.
      Survives any single account ban because the snapshot is in git.

  .github/workflows/deploy-from-template.yml
      Operator-triggered (workflow_dispatch) DR provisioning. Reads
      railway-template.json, calls Railway GraphQL to create one
      project + N services in the chosen account_alias, then writes
      disaster-recovery/last-restore.json with the new UUIDs.

  disaster-recovery/fleet-snapshot.json
      Initial snapshot at this commit (29 services, 4 projects).

  docs/DISASTER_RECOVERY.md
      Full runbook: what survives a ban (table), three trigger paths,
      required secrets, recovery sequence in detail, cost expectations,
      why the template only ships 6 of the 29 services.

  README.md
      DR section reorganised — three trigger paths instead of one.
      Old links to restore-fleet.json / docs/DR.md remain valid via
      the runbook's history section.

Operator action items (not blocking the merge):

  1. https://github.com/settings/personal-access-tokens — rotate the
     PAT shared in chat (out of caution); update TRIOS_REPO_PAT.
  2. https://railway.com/template/new — publish railway-template.json
     to the marketplace; this gives the README its real button URL.
  3. https://dash.cloudflare.com/r2 — create the bucket
     'igla-ledger-backups'; set R2_* secrets in Actions Secrets.

Refs trios#143 / openai#16.

* fix(dr): rewrite snapshot in Rust as 'tri-railway snapshot fleet' (L1)

CI flagged the original scripts/snapshot-fleet.sh under L1 — Rust only,
zero shell scripts. Replace it with a first-class subcommand:

    tri-railway snapshot fleet \
        --out disaster-recovery/fleet-snapshot.json \
        --account 'alias=acc1,token_env=...,project_env=...,label=...' \
        --email   'acc1=user@host'

The subcommand reads tokens and project UUIDs from process env (so
GitHub Actions can wire secrets via job-level env: blocks), probes
Railway GraphQL via the existing transport client, and writes the
exact same JSON the shell script produced. Tokens are NEVER recorded
in the snapshot — only the env-var name is kept under .token_secret.

fleet-snapshot.yml updated to call the binary instead of bash. Locally
verified against the live fleet: 29 services across 4 projects across
2 accounts, byte-identical output to the previous shell version.

Refs trios#143 / openai#16.

---------

Co-authored-by: Perplexity Computer <computer@perplexity.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants