Fix MLX multi-batch validation memory growth by yhn112 · Pull Request #32 · openai/parameter-golf

yhn112 · 2026-03-18T22:56:56Z

Summary

materialize each MLX validation batch loss before accumulating it, instead of building one lazy graph across the entire validation split
keep lightweight validation progress logging so long validation passes do not look hung

Why this happens

This only shows up when validation spans multiple batches. In eval_val(), total_loss was accumulated as an mx.array, so MLX kept extending a lazy graph until the final mx.eval(...). Memory usage then grew steadily during validation, and the script looked like it hung.

This is especially easy to hit in the MLX smoke flow because validation always runs over the full fineweb_val_* split, while the README example uses a small VAL_BATCH_SIZE. In practice the first visible symptom is often after the final training step, which makes it look like a post-training hang.

Single-batch validation does not exhibit the problem.

Logging

The progress logging is intentional here: once validation is split across many batches, a long final validation pass can otherwise look indistinguishable from a freeze.

Validation

python -m py_compile train_gpt_mlx.py
reproduced locally with a short MLX run where memory growth started only once final validation began, then verified the fix stopped the growth

0hq

Thanks! Missed this.

Fix MLX multi-batch validation memory growth

… fix) (openai#32) Anchor: phi^2 + phi^-2 = 3. The first scheduled hourly run failed with: jq: parse error: Invalid numeric literal at line 1, column 2 Root cause: 'tail -n 1' on the merged 2>&1 stream caught the tracing-subscriber INFO line ('audit triplet sealed experience=...') that prints AFTER the --json blob, instead of the JSON itself. Fix: 1. Split stdout (json + text summary) and stderr (tracing) into separate files: '> /tmp/accN.txt 2> /tmp/accN.log'. The Rust binary already writes tracing to stderr in main.rs (with_writer(io::stderr)), so the separation is honest, not a workaround. 2. Replace 'tail -n 1' with 'grep -E "^\s*\{" | tail -n 1'. The --json blob is a single line whose first non-whitespace char is '{', and tri-railway prints exactly one such line. This is robust against future text additions to the human summary. 3. Synthesize a DRIFT-shaped fallback JSON when grep finds nothing (network error, etc.) so the digest step never crashes; the workflow goes red on combined exit=1 instead. 4. Echo both stdout and stderr to the workflow log for triage. Verified locally against the live IGLA project (Acc1 token): grep -E '^\s*\{' /tmp/test_stdout.txt | tail -n 1 > /tmp/test.json jq -r '.verdict, .services, .exit_code' => NOT YET, 18, 2 Refs openai#16. Co-authored-by: Perplexity Computer <computer@perplexity.ai>

…penai#33) The previous fix (PR openai#32) extracted the JSON correctly but then piped the raw verdict string ('NOT YET') into $GITHUB_OUTPUT without a key, which the runner rejects: Unable to process file command 'output' successfully. Invalid format 'NOT YET' Fix: write 'verdict=<value>' instead. Also replace the space inside the verdict ('GATE-2 PASS', 'NOT YET') with an underscore so the value is a single token, since GITHUB_OUTPUT doesn't accept multi-word unencoded values without the multiline EOF marker. This output is informational only — the digest step reads from the JSON file directly via jq, so the encoding change has no downstream effect. Refs openai#16. Co-authored-by: Perplexity Computer <computer@perplexity.ai>

yhn112 added 2 commits March 19, 2026 01:27

Fix MLX validation loss accumulation

321e82c

Log MLX validation progress

e17ed01

0hq approved these changes Mar 18, 2026

View reviewed changes

0hq merged commit 8253577 into openai:main Mar 18, 2026

kxddry pushed a commit to kxddry/parameter-golf that referenced this pull request Mar 19, 2026

Merge pull request openai#32 from yhn112/fix-mlx-eval-memory-growth

9e39111

Fix MLX multi-batch validation memory growth

yhn112 deleted the fix-mlx-eval-memory-growth branch March 19, 2026 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MLX multi-batch validation memory growth#32

Fix MLX multi-batch validation memory growth#32
0hq merged 2 commits intoopenai:mainfrom
yhn112:fix-mlx-eval-memory-growth

yhn112 commented Mar 18, 2026

Uh oh!

0hq left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yhn112 commented Mar 18, 2026

Summary

Why this happens

Logging

Validation

Uh oh!

0hq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants