-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add mirrored-recurrence MLX non-record submission #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cschubiner
wants to merge
1
commit into
openai:main
Choose a base branch
from
cschubiner:codex/parameter-golf-mlx-local-submission
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
45 changes: 45 additions & 0 deletions
45
...ds/track_non_record_16mb/2026-03-19_MirrorRecurrence_MLX_M5Max_sp1024/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| This non-record run explores mirrored depth recurrence on Apple Silicon MLX. | ||
|
|
||
| Idea: | ||
| - Keep the baseline parameter budget almost unchanged by reusing `9` unique transformer blocks across `18` logical layers. | ||
| - Run the encoder schedule forward (`0..8`) and the decoder schedule in reverse (`8..0`) so the second half reuses the same weights with mirrored skip structure. | ||
| - Serialize only tensor state, excluding the Python schedule lists that are part of the recurrent control flow. | ||
|
|
||
| Why this is interesting: | ||
| - The challenge explicitly invites parameter tying and recurrent depth. | ||
| - This variant adds logical depth and compute without adding a second set of block weights. | ||
| - The resulting compressed artifact stays comfortably under the 16 MB cap. | ||
|
|
||
| Configuration: | ||
| - Hardware: Apple `M5 Max`, MLX `0.31.1` | ||
| - Data: published `fineweb10B_sp1024` export, full validation split, `1/195` training shards | ||
| - Layout: `VOCAB_SIZE=1024 NUM_LAYERS=18 UNIQUE_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2` | ||
| - Tied embeddings: `TIE_EMBEDDINGS=1` | ||
| - Batching: `TRAIN_BATCH_TOKENS=8192 TRAIN_SEQ_LEN=1024 VAL_BATCH_SIZE=131072` | ||
| - Training length: `ITERATIONS=300` | ||
|
|
||
| Command: | ||
| ```bash | ||
| RUN_ID=mirrorrec_18l_9u_300it_fix1 \ | ||
| ITERATIONS=300 \ | ||
| MAX_WALLCLOCK_SECONDS=0 \ | ||
| TRAIN_BATCH_TOKENS=8192 \ | ||
| VAL_LOSS_EVERY=0 \ | ||
| VAL_BATCH_SIZE=131072 \ | ||
| TRAIN_LOG_EVERY=50 \ | ||
| NUM_LAYERS=18 \ | ||
| UNIQUE_LAYERS=9 \ | ||
| python train_gpt.py | ||
| ``` | ||
|
|
||
| Key metrics: | ||
| - Pre-quant eval: `val_loss:3.7694`, `val_bpb:2.2325` | ||
| - Post-quant roundtrip eval: `val_loss:3.77618886`, `val_bpb:2.23647175` | ||
| - Train time: `295399ms` (`step_avg:984.66ms`) | ||
| - Serialized model int8+zlib: `7990030 bytes` | ||
| - Code size: `50818 bytes` | ||
| - Total submission size int8+zlib: `8040848 bytes` | ||
|
|
||
| Notes: | ||
| - This is not a record-track claim. It is a local non-record experiment intended to test whether mirrored block reuse is a productive direction under the parameter cap. | ||
| - The final script includes the serialization fix needed for recurrent schedules: only tensor state is exported and quantized. | ||
11 changes: 11 additions & 0 deletions
11
records/track_non_record_16mb/2026-03-19_MirrorRecurrence_MLX_M5Max_sp1024/submission.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| { | ||
| "author": "Clay Schubiner", | ||
| "github_id": "cschubiner", | ||
| "name": "Mirror Recurrence (18 logical / 9 unique)", | ||
| "blurb": "Non-record Apple Silicon MLX run that mirrors 9 unique transformer blocks across 18 logical layers, keeping the int8+zlib artifact under 16 MB while testing recurrent depth under the challenge parameter cap.", | ||
| "date": "2026-03-19T06:39:53Z", | ||
| "val_loss": 3.77618886, | ||
| "val_bpb": 2.23647175, | ||
| "bytes_total": 8040848, | ||
| "bytes_code": 50818 | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This README says the run used the published
fineweb10B_sp1024export with1/195train shards, but it never identifies which shard was kept or howDATA_PATHwas prepared. The checked-intrain.logshows the actual run only saw one shard (train_shards:1/195), so rerunning the documented command against a normalfineweb10B_sp1024export will train on all 195 shards and produce a materially different experiment. As written, the submission is not reproducible.Useful? React with 👍 / 👎.