Skip to content

Conversation

@woct0rdho
Copy link
Contributor

@woct0rdho woct0rdho commented Apr 30, 2025

What does this PR do?

We avoid executing get_slice unless map_location == "meta" to improve the performance when loading a model with a large number of tensors.

Even though we avoid the dtype check in Python, the dtype will be checked at https://github.com/huggingface/safetensors/blob/7d5af853631628137a79341ddc5611d18a17f3fe/bindings/python/src/lib.rs#L1186

Fixes #37887

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Rocketknight1

@github-actions github-actions bot marked this pull request as draft April 30, 2025 21:11
@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after review - pinging @ArthurZucker for core maintainer review and because he recently updated this code

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! about how much of an increase on load speed did you get with this? 🤗

@ArthurZucker ArthurZucker merged commit ee25d57 into huggingface:main May 1, 2025
20 checks passed
@woct0rdho
Copy link
Contributor Author

In my case the time to load Unsloth's Qwen3-30B-A3B reduced from 15 min to 2 min. But loading a dense model of the same size takes only seconds, and only one core is used when loading, so maybe it's still worth considering how to further speed up this.

@woct0rdho woct0rdho deleted the faster-load-state-dict branch May 1, 2025 15:03
@Rocketknight1
Copy link
Member

Rocketknight1 commented May 1, 2025

@woct0rdho it could definitely be worth optimizing this further. In the past, most models were either dense or MoEs with a small number of experts (e.g. Mixtral-8x7B), and so the slowdown here wasn't very important. However, I expect that models like Qwen3 or Deepseek-V3 will be more common in future, with a huge number of experts but only a small number of activated experts. These models combine high capacity with fast inference, which is very important for "reasoning" training pipelines that involve RL. On a fast SSD, a 30B model should not take 2 minutes to load, so we should consider parallel weight loading or other speedups!

cc @Narsil, is there a recommended way to speed up multiple calls to f.get_tensor()? Is there a method like get_tensors() that can efficiently get multiple tensors more quickly than repeatedly calling get_tensor(), or can we use multithreading if it releases the GIL?

@Rocketknight1
Copy link
Member

Hi @woct0rdho, we're investigating the slow loading issue. Can you give us more details on the disc you're loading from? For example, is it a local NVMe drive or a remote mount?

@Narsil
Copy link
Contributor

Narsil commented May 2, 2025

@woct0rdho Are you running on a mounted network disk by any chance ? Mounted network disks do no play well with memory mapping, and every read incurs a quite large overhead (depending on how it's mounted you can see the sort of latencies you see here).

On a AWS NVMe the model loads in ~12s for me (4xL4).
It takes ~500ms to load everything on CPU, and the rest to move to CUDA.
Fwiw, it takes ~1-2 mn to download the entire model on that machine from S3 (with HF transfer)

I created a small repro script outside of transformers:

from huggingface_hub import hf_hub_download

import torch

from safetensors.torch import load_file
import datetime

torch.zeros((2, 2)).cuda()  # Initialize cuda runtime to skip that in measurements

filenames = []
for i in range(16):
    rfilename = f"model-{i + 1:05d}-of-00016.safetensors"
    filename = hf_hub_download("Qwen/Qwen3-30B-A3B", rfilename)
    filenames.append(filename)

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device="cpu")
    all_data.update(data)

print(f"CPU Load Took {datetime.datetime.now() - start}")

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device=f"cuda:{i % 4}")
    all_data.update(data)

print(f"Took {datetime.datetime.now() - start}")

Results (4xL4)

CPU Load Took 0:00:00.466379
Took 0:00:11.799003

The machinery in transfomers, follows pretty closely what safetensors does internally (with some light additional bookkeeping, but it shouldn't be that bad).

Sidenote: HDD can also suffer from slow reads, and that can be fixed by changing f.keys() to f.ordered_keys() (because HDD prefer sequential reads and are much more sensitive to it than SSDs/NVMe).

@woct0rdho
Copy link
Contributor Author

woct0rdho commented May 2, 2025

Good point. I was running it on AutoDL (a cloud GPU provider). There is a folder /root/autodl-tmp/ for storage. It does not look like an individual disk or partition mounted by NFS, but it may be somehow virtualized and I don't know the underlying physical setup.

Update: I modified your repro script a bit to load only 4 files on 1 GPU:

#!/usr/bin/env python3

import os

os.environ["HF_HOME"] = "/root/autodl-tmp/.cache"

import datetime

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

torch.zeros((2, 2)).cuda()  # Initialize cuda runtime to skip that in measurements

filenames = []
for i in range(4):
    rfilename = f"model-{i + 1:05d}-of-00016.safetensors"
    filename = hf_hub_download("Qwen/Qwen3-30B-A3B", rfilename)
    # rfilename = f"model-{i + 1:05d}-of-00004.safetensors"
    # filename = hf_hub_download("unsloth/Qwen3-30B-A3B-bnb-4bit", rfilename)
    filenames.append(filename)

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device="cpu")
    all_data.update(data)

print(f"CPU Load Took {datetime.datetime.now() - start}")

start = datetime.datetime.now()
all_data = {}
for i, filename in enumerate(filenames):
    data = load_file(filename, device="cuda:0")
    all_data.update(data)

print(f"GPU Load Took {datetime.datetime.now() - start}")

The results are:

# Qwen/Qwen3-30B-A3B
CPU Load Took 0:00:00.142864
GPU Load Took 0:00:04.896957
# unsloth/Qwen3-30B-A3B-bnb-4bit
CPU Load Took 0:00:02.586595
GPU Load Took 0:00:12.870413

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
Improve performance of load_state_dict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance of load_state_dict with large number of tensors (Qwen3 MoE)

5 participants