|
| 1 | +This folder captures a non-record local MLX submission for adaptive eval-time context. |
| 2 | + |
| 3 | +The idea in this snapshot is simple: do one coarse pass over the validation stream, mark the harder windows from that pass, then rescore only those windows with a finer stride. The training setup stays close to the baseline MLX path; the change is in how the final roundtrip evaluation spends extra context. |
| 4 | + |
| 5 | +This is not a leaderboard claim. It is a local Apple Silicon result meant to document the idea, the code snapshot, and a same-setup comparison against standard final evaluation. |
| 6 | + |
| 7 | +Configuration: |
| 8 | +- Hardware: Apple M4 Pro, 48 GB unified memory |
| 9 | +- Track: non-record, local Apple Silicon MLX |
| 10 | +- Tokenizer/data: `fineweb10B_sp1024`, first train shard, first `32768` validation tokens |
| 11 | +- Model: SP-1024, `9x512`, `KV4`, tied embeddings |
| 12 | +- Training length: `200` iterations, `8192` train tokens/step |
| 13 | +- Final eval mode: adaptive |
| 14 | +- Adaptive eval settings: `coarse_stride=256`, `fine_stride=64`, `hard_fraction=0.25` |
| 15 | + |
| 16 | +Command used for the included adaptive run: |
| 17 | +```bash |
| 18 | +cd records/track_non_record_16mb/2026-03-19_AdaptiveEvalContext_MLX_M4Pro_sp1024_200it |
| 19 | +RUN_ID=cmp200_adapt_c256_f64_h025 \ |
| 20 | +SEED=1337 \ |
| 21 | +ITERATIONS=200 \ |
| 22 | +TRAIN_BATCH_TOKENS=8192 \ |
| 23 | +GRAD_ACCUM_STEPS=8 \ |
| 24 | +VAL_LOSS_EVERY=0 \ |
| 25 | +VAL_BATCH_SIZE=32768 \ |
| 26 | +VAL_MAX_TOKENS=32768 \ |
| 27 | +FINAL_ROUNDTRIP_EVAL=1 \ |
| 28 | +FINAL_EVAL_MODE=adaptive \ |
| 29 | +FINAL_EVAL_COARSE_STRIDE=256 \ |
| 30 | +FINAL_EVAL_FINE_STRIDE=64 \ |
| 31 | +FINAL_EVAL_HARD_FRACTION=0.25 \ |
| 32 | +FINAL_EVAL_BATCH_SEQS=16 \ |
| 33 | +DATA_PATH=../../../data/datasets/fineweb10B_sp1024 \ |
| 34 | +TOKENIZER_PATH=../../../data/tokenizers/fineweb_1024_bpe.model \ |
| 35 | +../../../.venv/bin/python train_gpt.py > train.log 2>&1 |
| 36 | +``` |
| 37 | + |
| 38 | +Included result (`train.log`): |
| 39 | +- Pre-quant eval at stop: `val_loss:4.1575`, `val_bpb:2.4070` |
| 40 | +- Post-roundtrip eval: `val_loss:4.15029331`, `val_bpb:2.40284524` |
| 41 | +- Eval time for final adaptive roundtrip pass: `2386ms` |
| 42 | +- Selected windows: `hard_windows:31/124`, `fine_windows:124` |
| 43 | +- Serialized model int8+zlib: `11239210 bytes` |
| 44 | +- Code size: `58701 bytes` |
| 45 | +- Total submission size int8+zlib: `11297911 bytes` |
| 46 | + |
| 47 | +Same-setup reference (`compare_standard.log`): |
| 48 | +- Standard final eval: `val_loss:4.16789573`, `val_bpb:2.41303630` |
| 49 | +- Eval time: `321ms` |
| 50 | + |
| 51 | +So in this local fixed-step proxy, the adaptive pass improves the final roundtrip score by about `0.01019 bpb` over the same setup with standard final evaluation, but it also increases final eval time. That tradeoff is the main reason this is being submitted as a non-record WIP rather than as a score claim. |
| 52 | + |
| 53 | +Included files: |
| 54 | +- `train_gpt.py` - exact MLX code snapshot used for the run |
| 55 | +- `train.log` - adaptive local run log |
| 56 | +- `compare_standard.log` - same-setup standard-eval comparison log |
| 57 | +- `submission.json` - metadata for the run |
0 commit comments