WIP: Add adaptive eval-time context non-record MLX submission#62
WIP: Add adaptive eval-time context non-record MLX submission#62stpcoder wants to merge 1 commit intoopenai:mainfrom
Conversation
b7e874d to
e13f5a4
Compare
Community Review — WIP: Add adaptive eval-time context non-record MLX submissionCompliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'mlx'. Classification via |
After PR openai#61 (byte-disjoint corpus split + assert_train_val_disjoint guard) shipped, fix-verify-s43 ran end-to-end on the post-fix pipeline and produced BPB 1.5492 at step=12000 — well below Gate-2 threshold 1.85 (margin +0.30). ## What this commit changes - README.md : leads with the honest Gate-2 pass; revised 5-way taxonomy - LEAK_INVESTIGATION.md : retraction header explaining the 216-row overcount - trios-igla-1/README.md + config.yaml : updated to point at fix-verify-s43 - ledger_2026-04-30.sql.gz : refreshed snapshot with new last_error markers ## 5-way reclassification (Neon last_error column) | | count | |---|--:| | post-openai#61 honest Gate-2 pass | 1 | | post-openai#61 early-stopped < step 9000 | 4 | | pre-openai#61 W-6 numerical collapse | 46 | | **pre-openai#61 leak (real)** | 42 | | **warmup artifact (NOT a leak)** | 179 | The 179 'warmup artifact' rows are early-stopped runs whose printed val_bpb stayed at 0.0000 for steps 1-8000 due to a trainer-side eval-loop bug (filed as trios-trainer-igla#62). On the post-openai#61 image, fix-verify-s43 escaped warmup at step=9000 and converged to 1.5492 by step=12000 — proving the artifact is trainer-side, not data-side. ## Pipeline as flown for fix-verify-s43 trios-trainer-igla : commit 9517980d (post-openai#61 byte-disjoint corpus) trios-railway : commit 69c3467 (no --ctx flag) + openai#56 --ctx accept on trainer + openai#58 smoke_train + stdout.flush() + openai#59 panic hook + startup diagnostic ## Refs trios-trainer-igla#56,openai#58,openai#59,openai#60,openai#61,openai#62 (all merged or filed) trios-railway@69c3467 trios-railway#100,openai#101,openai#105 (Scarabaeus Engine track) R5-honest. We retract the 216-row mass leak flag and submit fix-verify-s43 as our first honest Gate-2 pass candidate. Anchor: phi^2 + phi^-2 = 3.
Summary
This PR adds a non-record MLX submission under
records/track_non_record_16mb/for adaptive eval-time context.The setup is a local Apple Silicon run, not a leaderboard claim. The main change in this snapshot is the final evaluation path: instead of using one fixed policy over the whole validation stream, it does a coarse pass first and then rescoring with a finer stride on the harder windows from that pass.
Included files
README.mdsubmission.jsontrain.logcompare_standard.logtrain_gpt.pyRun details
9x512,KV4, tied embeddings32768validation tokens200iterationscoarse_stride=256,fine_stride=64,hard_fraction=0.25Result in
train.logval_bpb=2.4070val_bpb=2.4028452411297911bytesSame-setup reference
compare_standard.loguses the same setup with standard final evaluation:val_bpb=2.41303630val_bpb=2.40284524So in this local fixed-step proxy, the adaptive pass improves the final roundtrip score by about
0.01019 bpb, but it also makes the final eval pass slower. That tradeoff is the reason this is being submitted as a WIP non-record result rather than as a performance claim.Validation
python -m py_compile records/track_non_record_16mb/2026-03-19_AdaptiveEvalContext_MLX_M4Pro_sp1024_200it/train_gpt.py