Skip to content

Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier#492

Closed
chrisjmccormick wants to merge 5 commits intokarpathy:masterfrom
chrisjmccormick:rope-chunk-mlp-lr
Closed

Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier#492
chrisjmccormick wants to merge 5 commits intokarpathy:masterfrom
chrisjmccormick:rope-chunk-mlp-lr

Conversation

@chrisjmccormick
Copy link
Copy Markdown

@chrisjmccormick chrisjmccormick commented Feb 2, 2026

This PR includes two changes:

  1. Changing RoPE to use chunk instead of slice improved the backward kernel.
  2. Moving the 2x learning rate multiplier from the MLP input (c_fc) to the output (c_proj)

I ran d24 on an 8xH100 and saw improvements to both time and CORE:

Step 16704 | CORE metric: 0.2633
Total training time: 179.44m
Minimum validation bpb: 0.753336

A fun detail is that this was run on a $7.51/hr. spot instance, so this trained GPT-2 for ~$22.50. :)

A confounding factor is that I only downloaded 100 shards, so this was run on 3 epochs of the first (2.5B?) tokens of the dataset. The lr change was made to modded back in December and improved loss there, but it will be interesting to re-run this to see whether it was the dataset or the lr that delivered the score improvement.

This was also run using the dataloader in my previous PR (including the change to base_train.py where we disable the gc and manually collect every 2000 steps). I see that an updated dataloader was committed last night, so we'll need to re-time this.

Here was my run command. It starts by dumping the contents of the key files into the top of the output log (in the style of modded-nanogpt).

( cat ./nanochat/gpt.py; cat ./nanochat/optim.py; cat ./nanochat/dataloader.py; cat ./scripts/base_train.py; echo -e "\n\n===== TRAINING OUTPUT =====\n\n"; OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --run=d24-feb01 \
    --model-tag=d24_feb01 \
    --device-batch-size=16 \
    --sample-every=-1 \
    --save-every=-1 \
    --core-metric-max-per-task=-1 \
    --core-metric-every=3000 \
    --target-param-data-ratio=12 ) \
  2>&1 | tee ./logs/speedrun_d24_feb01-rope_chunk_mlp_lr_1x2x.log

I've included the log (and a number of other reference files) in a log dir for reference--not intending for these to be added to the repo.

SFT Results. How do these look? I haven't compared them yet.

Benchmark Accuracy
ARC-Easy 49.03%
ARC-Challenge 38.48%
MMLU 34.80%
GSM8K 4.70%
HumanEval 14.63%
SpellingBee 98.83%

Chunk vs. Slice

Comparing nanochat's attention implementation to an older version of modded-nanogpt, I noticed that modded uses the chunk operator to split each head in half. So instead of:

d = x.shape[3] // 2
x1, x2 = x[..., :d], x[..., d:] # split up last dim into two halves

This PR does:

x1, x2 = x.chunk(2, dim=-1)  # split head_dim into two halves

This lead to the compiler choosing a more efficient implementation of the backward pass, which you can see in the traces below:

image

LR Multiplier

The original Muon implementation by Keller Jordan includes a peculiar heuristic which applies a learning rate multiplier to matrices based on their memory layout. Memory layout has no effect on the math, so presumably this heuristic was written with some assumptions about how certain weight matrices will be stored in memory.

Where it came from

Here are my two theories on the heuristic:

First, there is a convention for classic MLPs where you apply a learning rate multiplier based on the ratio of "fan-in" vs. "fan-out".

However, this convention would dictate that the LR would be twice as high for an FFN's output weights (as in this PR).

Second, based on comments in his original Muon repo and his CFAR10 speedrun, Keller was also a CV researcher, and applied Muon to CNNs, and perhaps this heuristic works better there?

In modded-nanogpt

I applied this change in a PR to modded-nanogpt, and it had an interesting effect on the loss curve which suggests that it may correct for some overfitting caused by the original 2x-lr-on-mlp-input configuration:

image

(Amusingly, I actually wasn't aware that I had changed the MLP lr configuration until Larry Dial pointed it out. I incorrectly attributed the improvement to a different part of the PR at the time.)

How to Address it

To implement the change for the speedrunm, I had Claude flip the heuristic, from:

self._muon_lr_t.fill_(group["lr"] * max(1.0, shape[-2] / shape[-1])**0.5)

To:

# Shape-based LR scaling (flipped from original):
# - Tall matrices (input projections like c_fc): 1x
# - Wide matrices (output projections like c_proj): sqrt(cols/rows) → 2x for 1:4
ratio = shape[-2] / shape[-1]
lr_mult = 1.0 if ratio >= 1 else ratio**-0.5
self._muon_lr_t.fill_(group["lr"] * lr_mult)

However, I'd propose that we remove the heuristic altogether--the LR shouldn't be based on memory layout--and find a way to specify this more manually / directly.

Logs

I added a logs directory with files from the run just for reference--not proposing that we add these to the repo.

I don't think the report generated correctly, probably due to how I ran things, but I included the markdown files that looked correct.

I've shared the weights on huggingface here:
https://huggingface.co/ChrisMcCormick/nanochat-d24-2026-02-02/

Step Count Experiments

The improved CORE score suggests that we should be able to reduce the dataset size / step count, but experimenting with that would be expensive! Any ideas on how to go about that / how that part of the speedrun will work?

@chrisjmccormick
Copy link
Copy Markdown
Author

Karpathy tested these ideas and didn't see benefit from them, so I'm closing this.
I ran my experiments prior to the FP8 training, and I've confirmed on my end as well that the RoPE change no longer has benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants