Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier#492
Closed
chrisjmccormick wants to merge 5 commits intokarpathy:masterfrom
Closed
Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier#492chrisjmccormick wants to merge 5 commits intokarpathy:masterfrom
chrisjmccormick wants to merge 5 commits intokarpathy:masterfrom
Conversation
Author
|
Karpathy tested these ideas and didn't see benefit from them, so I'm closing this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR includes two changes:
chunkinstead of slice improved the backward kernel.c_fc) to the output (c_proj)I ran
d24on an 8xH100 and saw improvements to both time and CORE:A fun detail is that this was run on a $7.51/hr. spot instance, so this trained GPT-2 for ~$22.50. :)
A confounding factor is that I only downloaded 100 shards, so this was run on 3 epochs of the first (2.5B?) tokens of the dataset. The
lrchange was made to modded back in December and improved loss there, but it will be interesting to re-run this to see whether it was the dataset or thelrthat delivered the score improvement.This was also run using the dataloader in my previous PR (including the change to base_train.py where we disable the gc and manually collect every 2000 steps). I see that an updated dataloader was committed last night, so we'll need to re-time this.
Here was my run command. It starts by dumping the contents of the key files into the top of the output log (in the style of modded-nanogpt).
I've included the log (and a number of other reference files) in a
logdir for reference--not intending for these to be added to the repo.SFT Results. How do these look? I haven't compared them yet.
Chunk vs. Slice
Comparing nanochat's attention implementation to an older version of modded-nanogpt, I noticed that modded uses the
chunkoperator to split each head in half. So instead of:This PR does:
This lead to the compiler choosing a more efficient implementation of the backward pass, which you can see in the traces below:
LR Multiplier
The original Muon implementation by Keller Jordan includes a peculiar heuristic which applies a learning rate multiplier to matrices based on their memory layout. Memory layout has no effect on the math, so presumably this heuristic was written with some assumptions about how certain weight matrices will be stored in memory.
Where it came from
Here are my two theories on the heuristic:
First, there is a convention for classic MLPs where you apply a learning rate multiplier based on the ratio of "fan-in" vs. "fan-out".
However, this convention would dictate that the LR would be twice as high for an FFN's output weights (as in this PR).
Second, based on comments in his original Muon repo and his CFAR10 speedrun, Keller was also a CV researcher, and applied Muon to CNNs, and perhaps this heuristic works better there?
In modded-nanogpt
I applied this change in a PR to modded-nanogpt, and it had an interesting effect on the loss curve which suggests that it may correct for some overfitting caused by the original 2x-lr-on-mlp-input configuration:
(Amusingly, I actually wasn't aware that I had changed the MLP lr configuration until Larry Dial pointed it out. I incorrectly attributed the improvement to a different part of the PR at the time.)
How to Address it
To implement the change for the speedrunm, I had Claude flip the heuristic, from:
To:
However, I'd propose that we remove the heuristic altogether--the LR shouldn't be based on memory layout--and find a way to specify this more manually / directly.
Logs
I added a
logsdirectory with files from the run just for reference--not proposing that we add these to the repo.I don't think the report generated correctly, probably due to how I ran things, but I included the markdown files that looked correct.
I've shared the weights on huggingface here:
https://huggingface.co/ChrisMcCormick/nanochat-d24-2026-02-02/
Step Count Experiments
The improved CORE score suggests that we should be able to reduce the dataset size / step count, but experimenting with that would be expensive! Any ideas on how to go about that / how that part of the speedrun will work?