Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier by chrisjmccormick · Pull Request #492 · karpathy/nanochat

chrisjmccormick · 2026-02-02T16:59:19Z

This PR includes two changes:

Changing RoPE to use chunk instead of slice improved the backward kernel.
Moving the 2x learning rate multiplier from the MLP input (c_fc) to the output (c_proj)

I ran d24 on an 8xH100 and saw improvements to both time and CORE:

Step 16704 | CORE metric: 0.2633
Total training time: 179.44m
Minimum validation bpb: 0.753336

A fun detail is that this was run on a $7.51/hr. spot instance, so this trained GPT-2 for ~$22.50. :)

A confounding factor is that I only downloaded 100 shards, so this was run on 3 epochs of the first (2.5B?) tokens of the dataset. The lr change was made to modded back in December and improved loss there, but it will be interesting to re-run this to see whether it was the dataset or the lr that delivered the score improvement.

This was also run using the dataloader in my previous PR (including the change to base_train.py where we disable the gc and manually collect every 2000 steps). I see that an updated dataloader was committed last night, so we'll need to re-time this.

Here was my run command. It starts by dumping the contents of the key files into the top of the output log (in the style of modded-nanogpt).

( cat ./nanochat/gpt.py; cat ./nanochat/optim.py; cat ./nanochat/dataloader.py; cat ./scripts/base_train.py; echo -e "\n\n===== TRAINING OUTPUT =====\n\n"; OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=24 \
    --run=d24-feb01 \
    --model-tag=d24_feb01 \
    --device-batch-size=16 \
    --sample-every=-1 \
    --save-every=-1 \
    --core-metric-max-per-task=-1 \
    --core-metric-every=3000 \
    --target-param-data-ratio=12 ) \
  2>&1 | tee ./logs/speedrun_d24_feb01-rope_chunk_mlp_lr_1x2x.log

I've included the log (and a number of other reference files) in a log dir for reference--not intending for these to be added to the repo.

SFT Results. How do these look? I haven't compared them yet.

Benchmark	Accuracy
ARC-Easy	49.03%
ARC-Challenge	38.48%
MMLU	34.80%
GSM8K	4.70%
HumanEval	14.63%
SpellingBee	98.83%

Chunk vs. Slice

Comparing nanochat's attention implementation to an older version of modded-nanogpt, I noticed that modded uses the chunk operator to split each head in half. So instead of:

d = x.shape[3] // 2
x1, x2 = x[..., :d], x[..., d:] # split up last dim into two halves

This PR does:

x1, x2 = x.chunk(2, dim=-1)  # split head_dim into two halves

This lead to the compiler choosing a more efficient implementation of the backward pass, which you can see in the traces below:

LR Multiplier

The original Muon implementation by Keller Jordan includes a peculiar heuristic which applies a learning rate multiplier to matrices based on their memory layout. Memory layout has no effect on the math, so presumably this heuristic was written with some assumptions about how certain weight matrices will be stored in memory.

Where it came from

Here are my two theories on the heuristic:

First, there is a convention for classic MLPs where you apply a learning rate multiplier based on the ratio of "fan-in" vs. "fan-out".

However, this convention would dictate that the LR would be twice as high for an FFN's output weights (as in this PR).

Second, based on comments in his original Muon repo and his CFAR10 speedrun, Keller was also a CV researcher, and applied Muon to CNNs, and perhaps this heuristic works better there?

In modded-nanogpt

I applied this change in a PR to modded-nanogpt, and it had an interesting effect on the loss curve which suggests that it may correct for some overfitting caused by the original 2x-lr-on-mlp-input configuration:

(Amusingly, I actually wasn't aware that I had changed the MLP lr configuration until Larry Dial pointed it out. I incorrectly attributed the improvement to a different part of the PR at the time.)

How to Address it

To implement the change for the speedrunm, I had Claude flip the heuristic, from:

self._muon_lr_t.fill_(group["lr"] * max(1.0, shape[-2] / shape[-1])**0.5)

To:

# Shape-based LR scaling (flipped from original):
# - Tall matrices (input projections like c_fc): 1x
# - Wide matrices (output projections like c_proj): sqrt(cols/rows) → 2x for 1:4
ratio = shape[-2] / shape[-1]
lr_mult = 1.0 if ratio >= 1 else ratio**-0.5
self._muon_lr_t.fill_(group["lr"] * lr_mult)

However, I'd propose that we remove the heuristic altogether--the LR shouldn't be based on memory layout--and find a way to specify this more manually / directly.

Logs

I added a logs directory with files from the run just for reference--not proposing that we add these to the repo.

I don't think the report generated correctly, probably due to how I ran things, but I included the markdown files that looked correct.

I've shared the weights on huggingface here:
https://huggingface.co/ChrisMcCormick/nanochat-d24-2026-02-02/

Step Count Experiments

The improved CORE score suggests that we should be able to reduce the dataset size / step count, but experimenting with that would be expensive! Any ideas on how to go about that / how that part of the speedrun will work?

chrisjmccormick · 2026-02-14T16:20:06Z

Karpathy tested these ideas and didn't see benefit from them, so I'm closing this.
I ran my experiments prior to the FP8 training, and I've confirmed on my end as well that the RoPE change no longer has benefit.

chrisjmccormick added 5 commits February 2, 2026 08:18

Adding logs for reference

b3b2a82

Run command and refs

15e15f3

Redundant

6d38e4b

Split with chunk

5129a34

Flip LR mult

7ac837c

svlandeg added improvement suggest/review labels Feb 3, 2026

chrisjmccormick closed this Feb 14, 2026

svlandeg removed the suggest/review label Feb 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier#492

Faster RoPE+QK-norm backwards and swtiching MLP learning rate multiplier#492
chrisjmccormick wants to merge 5 commits intokarpathy:masterfrom
chrisjmccormick:rope-chunk-mlp-lr

chrisjmccormick commented Feb 2, 2026 •

edited

Loading

Uh oh!

chrisjmccormick commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrisjmccormick commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chunk vs. Slice

LR Multiplier

Logs

Step Count Experiments

Uh oh!

chrisjmccormick commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrisjmccormick commented Feb 2, 2026 •

edited

Loading