-
Notifications
You must be signed in to change notification settings - Fork 11.5k
gguf-py : support lazy tensor splitting #12809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.
My sha256sum is 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2 I may be able to pull your changes and see if it's different, but from looking at previously uploaded conversions it doesn't look like any folder metadata gets in there, and I don't add any of my own so should match up |
sha256 of conversion with this change: 56a723c60b94a95a5814c1ac6d5382b3011cb9931763e20f6f14aec264348bf2 so it matches, woo! The conversion wasn't WAY faster, still took well over an hour, I think about 1:30, but still faster than before which was over 1:45 🤷 |
* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h
* gguf-py : support lazy tensor splitting Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation. * gguf-py : fix flake8 lint
Splitting usually involves returning tuples of tensors, which need to be handled properly to avoid early eager evaluation.
As explained in #12791 (comment), this will likely help reducing the RAM usage when converting Llama4, since the approach in #12791 uses
torch.split
on the FFN projections.TODO:
Make sure to read the contributing guidelines before submitting a PR