Improve performance of `load_state_dict` #37902

woct0rdho · 2025-04-30T21:11:35Z

What does this PR do?

We avoid executing get_slice unless map_location == "meta" to improve the performance when loading a model with a large number of tensors.

Even though we avoid the dtype check in Python, the dtype will be checked at https://github.com/huggingface/safetensors/blob/7d5af853631628137a79341ddc5611d18a17f3fe/bindings/python/src/lib.rs#L1186

Fixes #37887

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1

github-actions · 2025-04-30T21:11:48Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

src/transformers/modeling_utils.py

Rocketknight1

LGTM after review - pinging @ArthurZucker for core maintainer review and because he recently updated this code

HuggingFaceDocBuilderDev · 2025-05-01T14:09:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks! about how much of an increase on load speed did you get with this? 🤗

woct0rdho · 2025-05-01T15:03:16Z

In my case the time to load Unsloth's Qwen3-30B-A3B reduced from 15 min to 2 min. But loading a dense model of the same size takes only seconds, and only one core is used when loading, so maybe it's still worth considering how to further speed up this.

Rocketknight1 · 2025-05-01T15:09:00Z

@woct0rdho it could definitely be worth optimizing this further. In the past, most models were either dense or MoEs with a small number of experts (e.g. Mixtral-8x7B), and so the slowdown here wasn't very important. However, I expect that models like Qwen3 or Deepseek-V3 will be more common in future, with a huge number of experts but only a small number of activated experts. These models combine high capacity with fast inference, which is very important for "reasoning" training pipelines that involve RL. On a fast SSD, a 30B model should not take 2 minutes to load, so we should consider parallel weight loading or other speedups!

cc @Narsil, is there a recommended way to speed up multiple calls to f.get_tensor()? Is there a method like get_tensors() that can efficiently get multiple tensors more quickly than repeatedly calling get_tensor(), or can we use multithreading if it releases the GIL?

Rocketknight1 · 2025-05-02T14:53:34Z

Hi @woct0rdho, we're investigating the slow loading issue. Can you give us more details on the disc you're loading from? For example, is it a local NVMe drive or a remote mount?

Narsil · 2025-05-02T14:54:08Z

@woct0rdho Are you running on a mounted network disk by any chance ? Mounted network disks do no play well with memory mapping, and every read incurs a quite large overhead (depending on how it's mounted you can see the sort of latencies you see here).

On a AWS NVMe the model loads in ~12s for me (4xL4).
It takes ~500ms to load everything on CPU, and the rest to move to CUDA.
Fwiw, it takes ~1-2 mn to download the entire model on that machine from S3 (with HF transfer)

I created a small repro script outside of transformers:

from huggingface_hub import hf_hub_download

import torch

from safetensors.torch import load_file
import datetime

torch.zeros((2, 2)).cuda()  # Initialize cuda runtime to skip that in measurements

filenames = []
for i in range(16):
    rfilename = f"model-{i + 1:05d}-of-00016.safetensors"
    filename = hf_hub_download("Qwen/Qwen3-30B-A3B", rfilename)
    filenames.append(filename)

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device="cpu")
    all_data.update(data)

print(f"CPU Load Took {datetime.datetime.now() - start}")

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device=f"cuda:{i % 4}")
    all_data.update(data)

print(f"Took {datetime.datetime.now() - start}")

Results (4xL4)

CPU Load Took 0:00:00.466379
Took 0:00:11.799003

The machinery in transfomers, follows pretty closely what safetensors does internally (with some light additional bookkeeping, but it shouldn't be that bad).

Sidenote: HDD can also suffer from slow reads, and that can be fixed by changing f.keys() to f.ordered_keys() (because HDD prefer sequential reads and are much more sensitive to it than SSDs/NVMe).

woct0rdho · 2025-05-02T15:16:37Z

Good point. I was running it on AutoDL (a cloud GPU provider). There is a folder /root/autodl-tmp/ for storage. It does not look like an individual disk or partition mounted by NFS, but it may be somehow virtualized and I don't know the underlying physical setup.

Update: I modified your repro script a bit to load only 4 files on 1 GPU:

#!/usr/bin/env python3

import os

os.environ["HF_HOME"] = "/root/autodl-tmp/.cache"

import datetime

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

torch.zeros((2, 2)).cuda()  # Initialize cuda runtime to skip that in measurements

filenames = []
for i in range(4):
    rfilename = f"model-{i + 1:05d}-of-00016.safetensors"
    filename = hf_hub_download("Qwen/Qwen3-30B-A3B", rfilename)
    # rfilename = f"model-{i + 1:05d}-of-00004.safetensors"
    # filename = hf_hub_download("unsloth/Qwen3-30B-A3B-bnb-4bit", rfilename)
    filenames.append(filename)

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device="cpu")
    all_data.update(data)

print(f"CPU Load Took {datetime.datetime.now() - start}")

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device="cuda:0")
    all_data.update(data)

print(f"GPU Load Took {datetime.datetime.now() - start}")

The results are:

# Qwen/Qwen3-30B-A3B
CPU Load Took 0:00:00.142864
GPU Load Took 0:00:04.896957
# unsloth/Qwen3-30B-A3B-bnb-4bit
CPU Load Took 0:00:02.586595
GPU Load Took 0:00:12.870413

Improve performance of load_state_dict

github-actions bot marked this pull request as draft April 30, 2025 21:11

woct0rdho mentioned this pull request Apr 30, 2025

Performance of load_state_dict with large number of tensors (Qwen3 MoE) #37887

Closed

woct0rdho marked this pull request as ready for review April 30, 2025 21:12

github-actions bot requested review from ArthurZucker and Rocketknight1 April 30, 2025 21:13

Rocketknight1 reviewed May 1, 2025

View reviewed changes

src/transformers/modeling_utils.py Show resolved Hide resolved

Rocketknight1 mentioned this pull request May 1, 2025

Fix: Optimize safetensors load by moving dtype check for meta device #37903

Closed

Improve performance of load_state_dict

467d86a

Rocketknight1 approved these changes May 1, 2025

View reviewed changes

ArthurZucker approved these changes May 1, 2025

View reviewed changes

ArthurZucker merged commit ee25d57 into huggingface:main May 1, 2025
20 checks passed

woct0rdho deleted the faster-load-state-dict branch May 1, 2025 15:03

matthewdouglas mentioned this pull request May 9, 2025

Unable To Load 4-Bit Qwen3 30B-A3B For Inference unslothai/unsloth#2500

Closed

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025

Improve performance of load_state_dict (huggingface#37902)

4d76ac1

Improve performance of load_state_dict

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of `load_state_dict` #37902

Improve performance of `load_state_dict` #37902

Uh oh!

woct0rdho commented Apr 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

Uh oh!

Rocketknight1 left a comment

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

woct0rdho commented May 1, 2025

Uh oh!

Rocketknight1 commented May 1, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented May 2, 2025

Uh oh!

Narsil commented May 2, 2025 •

edited

Loading

Uh oh!

woct0rdho commented May 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Improve performance of load_state_dict #37902

Improve performance of load_state_dict #37902

Uh oh!

Conversation

woct0rdho commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

woct0rdho commented May 1, 2025

Uh oh!

Rocketknight1 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented May 2, 2025

Uh oh!

Narsil commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woct0rdho commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Improve performance of `load_state_dict` #37902

Improve performance of `load_state_dict` #37902

woct0rdho commented Apr 30, 2025 •

edited

Loading

Rocketknight1 commented May 1, 2025 •

edited

Loading

Narsil commented May 2, 2025 •

edited

Loading

woct0rdho commented May 2, 2025 •

edited

Loading