MirrorLoop HRC + LexLoRE non-record submission#2004
Open
corbensorenson wants to merge 9 commits intoopenai:mainfrom
Open
MirrorLoop HRC + LexLoRE non-record submission#2004corbensorenson wants to merge 9 commits intoopenai:mainfrom
corbensorenson wants to merge 9 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record / art submission
This PR adds a non-record/art submission for a novel 16MB architecture family: MirrorLoop HRC + LexLoRE.
This is not claiming an official leaderboard record. It is an art/non-record submission with honest evidence from local, 1xH100, and a very limited 8xH100 window.
The broader experiment repository is public here:
https://github.com/corbensorenson/parameter-golf-experiments
Compute context
I applied for compute support but did not receive any email response before the deadline window. The 8xH100 results here came from an approximately one-hour self-funded RunPod 8xH100 rental. That was all the 8x time/funds I had available, so the 8x rows should be read as narrow evidence from a constrained window, not as a fully tuned multi-seed official record attempt.
Best preserved results
Best under-cap 1xH100 scout at PR-open time:
h100_batch32k_d704e832_w2200_q8_coreattn1_lqer10t20_vocabmoe_qk551.356921295018119.57 ms15,658,145Best completed under-cap 8xH100 row from the one-hour self-funded follow-up:
final8x_legal_196k_r2_d704e768_w2200_wd02_lqer6t12_vocabmoe_qk551.354964191.3191665890.13 ms15,989,74910,251The first 8x e832 row used the cluster well but exceeded the decimal artifact cap:
final8x_196k_r2_d704e832_w2200_wd02_lqer8t16_vocabmoe_qk551.357047471.3174662890.54 ms16,413,081What the 8xH100 test did and did not show
The 8x run is useful mostly as a negative result. It showed that the code can use the 8xH100 pod efficiently, but it did not unlock a large final exported-loss improvement. Best legal 1x was
1.35692129; best legal 8x was1.35496419. That small gain suggests the current MirrorLoop/LexLoRE spine is not simply wall-clock limited. The binding issues look more like architecture capacity, export/compression gap, and the 16MB artifact constraint.The e832 result is still included because it is useful architecture/systems evidence, but it is not a legal under-cap artifact. After seeing that cap miss, I stopped the same-shape higher-LQER rows and moved to e768 legalizer rows.
Included 8xH100 logs
logs/8xh100_runpod_final8x_20260430_185628/- live snapshot while the first 8x matrix was running.logs/8xh100_runpod_final8x_20260430_185628_completed1/- completed first e832 row plus stopped partial second row.logs/8xh100_runpod_legalfallback_20260430_191032_completed1/- first completed under-cap e768 legalizer row.logs/8xh100_runpod_legalfallback_20260430_191032_completed2/- first two completed e768 legalizer rows, including the current best under-cap 8x result.What is novel here
In plain terms, this is not a standard stack of unique transformer layers. It uses an explicitly routed mirrored recurrent circuit:
The project called this HRC; the README defines that as an hourglass recurrent circuit:
It also uses LexLoRE, implemented under the older
VOCAB_MOE_*flag names: small token-conditioned low-rank residual experts atinput,loop_first. This is not a full sparse MoE; it is a lightweight lexical adapter bank.Main ingredients
012 | 34567 | 34567 | 210input,loop_first5.5Validation performed
python -m py_compileontrain_gpt.pyand helper modulespython -m json.tool submission.jsonbash -n run_1xh100_best.shClaims and caveats
Claimed:
1.35692129BPB1.35496419BPBNot claimed:
The README is intentionally explicit about these limitations so the submission is easy to review and does not hide the weaker parts of the evidence.