gguf-py : support lazy tensor splitting #12809

compilade · 2025-04-07T23:33:26Z

Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.

As explained in #12791 (comment), this will likely help reducing the RAM usage when converting Llama4, since the approach in #12791 uses torch.split on the FFN projections.

TODO:

Test conversion with Llama4 and make sure this helps with RAM usage and the output is the same
- @bartowski1182 or @ddh0 if you have the hashes for a previous conversion of the slow-to-convert model(s) that would be helpful (although the hash may depend on the directory name of the source directory since the metadata of the model potentially includes part of it)

Make sure to read the contributing guidelines before submitting a PR

Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.

bartowski1182 · 2025-04-08T01:23:02Z

My sha256sum is 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2

I may be able to pull your changes and see if it's different, but from looking at previously uploaded conversions it doesn't look like any folder metadata gets in there, and I don't add any of my own so should match up

bartowski1182 · 2025-04-08T05:01:57Z

sha256 of conversion with this change: 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2

so it matches, woo!

The conversion wasn't WAY faster, still took well over an hour, I think about 1:30, but still faster than before which was over 1:45 🤷

ggml-org#12809

* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h

* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint

gguf-py : support lazy tensor splitting

6cbbd8e

Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.

compilade added the python python script changes label Apr 7, 2025

compilade requested a review from ngxson April 7, 2025 23:33

gguf-py : fix flake8 lint

da140da

compilade mentioned this pull request Apr 8, 2025

llama : Support llama 4 text-only #12791

Merged

3 tasks

ngxson approved these changes Apr 8, 2025

View reviewed changes

danielhanchen added a commit to unslothai/llama.cpp that referenced this pull request Apr 8, 2025

Update lazy.py

6c77444

ggml-org#12809

ngxson merged commit a226bc7 into master Apr 8, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf-py : support lazy tensor splitting #12809

gguf-py : support lazy tensor splitting #12809

compilade commented Apr 7, 2025 •

edited

Loading

bartowski1182 commented Apr 8, 2025

bartowski1182 commented Apr 8, 2025

gguf-py : support lazy tensor splitting #12809

gguf-py : support lazy tensor splitting #12809

Conversation

compilade commented Apr 7, 2025 • edited Loading

TODO:

bartowski1182 commented Apr 8, 2025

bartowski1182 commented Apr 8, 2025

compilade commented Apr 7, 2025 •

edited

Loading