Skip to content

gguf-py : support lazy tensor splitting #12809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 8, 2025
Merged

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Apr 7, 2025

Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.

As explained in #12791 (comment), this will likely help reducing the RAM usage when converting Llama4, since the approach in #12791 uses torch.split on the FFN projections.

TODO:

  • Test conversion with Llama4 and make sure this helps with RAM usage and the output is the same
    • @bartowski1182 or @ddh0 if you have the hashes for a previous conversion of the slow-to-convert model(s) that would be helpful (although the hash may depend on the directory name of the source directory since the metadata of the model potentially includes part of it)

Make sure to read the contributing guidelines before submitting a PR

Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.
@compilade compilade added the python python script changes label Apr 7, 2025
@compilade compilade requested a review from ngxson April 7, 2025 23:33
@bartowski1182
Copy link
Contributor

My sha256sum is 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2

I may be able to pull your changes and see if it's different, but from looking at previously uploaded conversions it doesn't look like any folder metadata gets in there, and I don't add any of my own so should match up

@compilade compilade mentioned this pull request Apr 8, 2025
3 tasks
@bartowski1182
Copy link
Contributor

sha256 of conversion with this change: 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2

so it matches, woo!

The conversion wasn't WAY faster, still took well over an hour, I think about 1:30, but still faster than before which was over 1:45 🤷

danielhanchen added a commit to unslothai/llama.cpp that referenced this pull request Apr 8, 2025
@ngxson ngxson merged commit a226bc7 into master Apr 8, 2025
5 checks passed
tastelikefeet added a commit to tastelikefeet/llama.cpp that referenced this pull request Apr 10, 2025
* master: (123 commits)
  cuda : add f32 to bf16 copy op (ggml-org#12806)
  llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834)
  llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825)
  server : fix thread.join() on exit (ggml-org#12831)
  llava: add more helper functions to check projector types in clip context (ggml-org#12824)
  arg : Including limits file on AIX (ggml-org#12822)
  server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785)
  Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812)
  gguf-py : support lazy tensor splitting (ggml-org#12809)
  llama : Support llama 4 text-only (ggml-org#12791)
  opencl: better identify Adreno GPU (ggml-org#12760)
  hellaswag: display estimated score confidence interval (ggml-org#12797)
  cuda : fix HIP and MUSA BF16 (#0)
  sync : ggml
  ggml : simplify Arm fp16 CPU logic (ggml/1177)
  CUDA: don't convert BF16 weights to FP32 (ggml/1174)
  cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)
  sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734)
  ci : no curl on ggml-ci (ggml-org#12796)
  cmake : enable curl by default (ggml-org#12761)
  ...

# Conflicts:
#	common/arg.cpp
#	common/common.cpp
#	common/common.h
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
* gguf-py : support lazy tensor splitting

Splitting usually involves returning tuples of tensors,
which need to be handled properly to avoid early eager evaluation.

* gguf-py : fix flake8 lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants