-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Track and free temporary ggml_tensor_extra_gpu struct #2195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this solution is overengineered. How about this instead: allocate a small buffer in the beginning and re-use it to hold the |
That works but where is this pool though? Global? There are 4 cases here: https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L3335
So how to know when to alloc from the pool? The begin/end was there because I can't decide which one is temporary. |
Since right now everything in
The inplace and cpy cases are for tensors that don't actually change the data. So they use the data pointers of the tensors with the actual data. The scratch case is for tensors that hold only temporary results that are okay to overwrite at a later date. The last case is needed for the KV cache whose data should not be overwritten, thus it's not on the scratch buffer. Currently there is a lot of overlap and confusion between In any case, the pool should be used for the inplace, cpy, and scratch cases; the KV cache data should not be overwritten and it gets already freed by the |
In What should be done about those? Edit: Wait, those are temporary? |
What I am working on is going to change significantly how resources are managed. I will open a draft PR in the next days that will clarify some of these things, but it's going to take a while until it is ready, multi-GPU is not even supported in my branch yet. So if you need to fix this now, just do it in whatever way is more convenient to you, and don't worry too much about making the design future-proof.
The loras are merged into the model weights, so whatever resources are needed to apply them, they aren't used afterward. |
When a LoRA is applied a small graph that modifies the weights is executed. The final node is pre-allocated but there are some temporary tensors in-between. There is no practical difference compared to the larger graphs during eval. |
Close in favor of: #2220 |
Fix #2145.
Temporary allocations during eval are tracked and freed at the end.
I decided to go with the implicit context idea here: #2146 (comment) since the code change is minimal.
Pooling could be added if needed and freed in
llama_backend_free
.Tested on my machine with
--ignore-eos
to keep generation running and RAM usage does not increase anymore.