WIP: Add adaptive eval-time context non-record MLX submission by stpcoder · Pull Request #62 · openai/parameter-golf

stpcoder · 2026-03-19T06:57:25Z

Summary

This PR adds a non-record MLX submission under records/track_non_record_16mb/ for adaptive eval-time context.

The setup is a local Apple Silicon run, not a leaderboard claim. The main change in this snapshot is the final evaluation path: instead of using one fixed policy over the whole validation stream, it does a coarse pass first and then rescoring with a finer stride on the harder windows from that pass.

Included files

README.md
submission.json
train.log
compare_standard.log
train_gpt.py

Run details

hardware: Apple M4 Pro, 48 GB unified memory
model: SP-1024, 9x512, KV4, tied embeddings
train data: first FineWeb train shard
validation: first 32768 validation tokens
training length: 200 iterations
adaptive eval settings: coarse_stride=256, fine_stride=64, hard_fraction=0.25

Result in `train.log`

pre-quant eval at stop: val_bpb=2.4070
final int8+zlib roundtrip: val_bpb=2.40284524
total submission size: 11297911 bytes

Same-setup reference

compare_standard.log uses the same setup with standard final evaluation:

standard final roundtrip: val_bpb=2.41303630
adaptive final roundtrip: val_bpb=2.40284524

So in this local fixed-step proxy, the adaptive pass improves the final roundtrip score by about 0.01019 bpb, but it also makes the final eval pass slower. That tradeoff is the reason this is being submitted as a WIP non-record result rather than as a performance claim.

Validation

python -m py_compile records/track_non_record_16mb/2026-03-19_AdaptiveEvalContext_MLX_M4Pro_sp1024_200it/train_gpt.py
included local MLX run logs for both adaptive and standard final eval under the same setup

MatoTeziTanka · 2026-04-11T20:14:20Z

Community Review — WIP: Add adaptive eval-time context non-record MLX submission

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'mlx'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

PEP 701 f-string nesting — e.g. log(f" {cat}: {", ".join(...)}") is valid Python 3.12+ but invalid Python 3.10 because the inner ", " re-enters the outer double-quote context. One-character fix: ', ' instead of ", ". See PR Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541 / Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 for reference.
Missing flash_attn variants — e.g. from flash_attn_interface import flash_attn_varlen_func when the wrapper script only stubs flash_attn_func. Not a PR defect on H100s, but the eval image / CPU preflight path needs a guarded import.
Local compiled extension — e.g. import cutlass_evt_fusion from a records/*/cutlass_evt_fusion/ subfolder that isn't on the import path at smoke time. Usually an import-order issue inside the script.
Actual syntax error — typo, missing bracket, etc.

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

After PR openai#61 (byte-disjoint corpus split + assert_train_val_disjoint guard) shipped, fix-verify-s43 ran end-to-end on the post-fix pipeline and produced BPB 1.5492 at step=12000 — well below Gate-2 threshold 1.85 (margin +0.30). ## What this commit changes - README.md : leads with the honest Gate-2 pass; revised 5-way taxonomy - LEAK_INVESTIGATION.md : retraction header explaining the 216-row overcount - trios-igla-1/README.md + config.yaml : updated to point at fix-verify-s43 - ledger_2026-04-30.sql.gz : refreshed snapshot with new last_error markers ## 5-way reclassification (Neon last_error column) | | count | |---|--:| | post-openai#61 honest Gate-2 pass | 1 | | post-openai#61 early-stopped < step 9000 | 4 | | pre-openai#61 W-6 numerical collapse | 46 | | **pre-openai#61 leak (real)** | 42 | | **warmup artifact (NOT a leak)** | 179 | The 179 'warmup artifact' rows are early-stopped runs whose printed val_bpb stayed at 0.0000 for steps 1-8000 due to a trainer-side eval-loop bug (filed as trios-trainer-igla#62). On the post-openai#61 image, fix-verify-s43 escaped warmup at step=9000 and converged to 1.5492 by step=12000 — proving the artifact is trainer-side, not data-side. ## Pipeline as flown for fix-verify-s43 trios-trainer-igla : commit 9517980d (post-openai#61 byte-disjoint corpus) trios-railway : commit 69c3467 (no --ctx flag) + openai#56 --ctx accept on trainer + openai#58 smoke_train + stdout.flush() + openai#59 panic hook + startup diagnostic ## Refs trios-trainer-igla#56,openai#58,openai#59,openai#60,openai#61,openai#62 (all merged or filed) trios-railway@69c3467 trios-railway#100,openai#101,openai#105 (Scarabaeus Engine track) R5-honest. We retract the 216-row mass leak flag and submit fix-verify-s43 as our first honest Gate-2 pass candidate. Anchor: phi^2 + phi^-2 = 3.

stpcoder changed the title ~~WIP: local MLX research support for eval/export experiments~~ WIP: local MLX workflow for adaptive eval/export experiments Mar 19, 2026

stpcoder changed the title ~~WIP: local MLX workflow for adaptive eval/export experiments~~ WIP: local MLX workflow for adaptive eval-time context Mar 19, 2026

Add adaptive eval-time context non-record MLX submission

e13f5a4

stpcoder force-pushed the research/clean-room branch from b7e874d to e13f5a4 Compare March 19, 2026 11:21

stpcoder changed the title ~~WIP: local MLX workflow for adaptive eval-time context~~ WIP: Add adaptive eval-time context non-record MLX submission Mar 19, 2026

0hq added the not ready for review label Mar 19, 2026

himanalot mentioned this pull request Mar 27, 2026

Record: Nacrith Log-Bias + Full-Rescore N-gram — val_bpb 0.00000035 (3-seed mean) #959

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add adaptive eval-time context non-record MLX submission#62

WIP: Add adaptive eval-time context non-record MLX submission#62
stpcoder wants to merge 1 commit intoopenai:mainfrom
stpcoder:research/clean-room

stpcoder commented Mar 19, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stpcoder commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Included files

Run details

Result in train.log

Same-setup reference

Validation

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Community Review — WIP: Add adaptive eval-time context non-record MLX submission

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stpcoder commented Mar 19, 2026 •

edited

Loading

Result in `train.log`

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading